Multivendor Event Management for Mainframe System Health
Birth, marriage, death. We speak of these events loosely in common usage. When people are asked when they were born, for example, most will answer with a year, whereas an electronic health record will express it down to the minute. Similarly, is “marriage” associated with a wedding day or with the issuance of a marriage license?
The correct answer — for software as well as for people — is clearly context-dependent.
Computing: Eventful, Sometimes Stressful
A longtime goal of systems management tools for mainframes has been to identify performance problems and, where possible, to proactively “heal” them. There has been undeniable progress, but systems configurations have become steadily more complex. In some ways, systems management tools aren’t keeping up.
To understand why, let’s return to the human “life event” context. The Holmes and Rahe stress scale lists 43 “life events” that are strongly correlated with illness. Key among them for adults are the death of a spouse, imprisonment, and personal injury. When treating a person’s illness, clinicians must consider these events, even though it can be difficult to specify the particular event parameters.
By analogy, SMF data represents the electronic health record for the mainframe — for the operational intelligence it offers. Several vendors have developed sophisticated tools to process SMF and other data — and correlate it with other mainframe activities. The concept of an event is so fundamental that in the internet search world, it has been canonized in schema.org. But for mainframe systems management, the standardization is less concrete. Rather than through standardization, solution paths are offered by mature products from IBM and others in its mainframe ecosystem.
Mainframe, Heal Thyself
“Data to Dashboard” was a tour of 30 mainframe SMF types. SMF records are used for tuning, optimization, audit and security. For example, type 110 records can be used for tuning CICS. Operational intelligence allows for predicting failures, improving scheduling, managing capacity, identifying problems with applications, and characterizing workloads. Recent releases of z/OS, according to IBM, allow for increased ways to improve performance — squeezing more bang from the buck.
Some tools have moved toward self-healing (see the Easy Tier discussion below), but most mainframe techs will operate using one or more of the commercial tools available.
Computer Associates Event Management and Automation
CA OPS/MVS provides a means to identify events via its Event Notification Facility (CAIENF). CAIENF collects data from Z/OS as well as CICS, DB2 and even from Unix System Services (USS).
The automation part of “event management and automation” means first recognizing different types of events and how they are related to system configurations, networks and people. CA presented this concept in a compelling diagram that J. Morris, A. Kira and K. O’Quinn titled “The Event Management Pyramid.”
Event Management Pyramid (c) Computer Associates 2014
CA offers Restful web interfaces for its product, which CA says enables “triggering of automation and event management and querying of gathered information from non-z/OS environments.” This is the sort of instrumentation that could be useful for mainframe DevOps.
BMC Event Visibility through TrueSight
Over at BMC, one Operational Intelligence tool is TrueSight, which BMC dubs its Operations Management product.
BMC also offers CMF Monitor, which offers an alternative to IBM’s Resource Measurement Facility (RMF). CMF Monitor is part of the BMC MainView Monitoring suite, through which SMF data flows (shown below).
Even within these tools, event management can be a multivendor thrust. This is most often seen by sharing the underlying event logs, but sometimes the tools themselves interoperate.
For example, Compuware’s Strobe can be launched from BMC’s MainView, as shown below. The combined resource allows for improved monitoring of DB2, COBOL, IMS, or MQ processes in z/OS.
BMC Cost Analyzer and Compuware Strobe Interop (via Compuware and Alan Radding’s DancingDinosaur.com)
Dynatrace, formerly Compuware APM, allows for organizing event data around call trees to help identify resource-heavy execution areas, such as might be visualized in a heat map. (The Heat Map concept is incorporated into IBM’s Easy Tier Heat Map Transfer utility, which automates the process of incorporating I/O of a workload activity metric into a system’s storage hierarchy – increasingly significant as the use of flash memory increases.)
OSI Layer Seven is where Strobe, now part of Dynatrace APM for Mainframe (spun off from Compuware in 2014), is directing its light. Many of the levers available to systems administrators exist at the application layer, such as DB2.
Strobe has been integrated with BMC’s MainView, as shown below. Strobe can be launched from MainView explorer to profile measurement of a particular job.
Integration of Dynatrace Strobe with BMC MainView
Compuware’s Topaz Runtime Visualizer, updated late last year, allows for visualization of application performance with visibility into source code. Several visualizations in Topaz (shown below) could be used to pinpoint code-level events that can be attributed to SMF values.
Possible Code Event Visualizations Offered by Compuware Topaz
IBM Tivoli OMEGAMON
Among the event management capabilities in IBM’s own portfolio of monitoring tools is Tivoli OMEGAMON, which it acquired from Candle Corp. more than a decade ago. Wayne Bucek, in a 2011 post explaining how to forward alerts and closing events to third-party event managers from OMEGAMON XE, called “consolidated event management” a “top priority IT initiative.”
The approach suggested by Bucek involves policy automation, which enables, for example, reset events to be sent to third-party event managers when events in OMEGAMON XE are no longer true. This is accomplished using a drag-and-drop workflow editing tool.
Tivoli OMEGAMON XE agents for System z encompass CICS, DB2, IMS, networks, messaging, storage, z/VM, and z/OS events. As shown in the diagram below, monitoring events from heterogeneous systems and applications is no longer the exception.
IBM Tivoli Event Monitoring Components (via IBM)
These resources allow users to choreograph the flow of events and to describe the types of “situations” in which event data will flow, as shown in this screen from the Tivoli Situation Editor.
IBM Tivoli Situation/Workflow Editor (via IBM)
Future Event Representations, Interop and Automation
New design patterns for detecting z/OS events and managing them — in real time — are expected to emerge.
One example was offered by IBM’s Nick Clayton at SHARE in 2014. In “Easy Tiering with DS8870, Clayton discusses volume management, enabled by monitoring and collection of individual 1GB extents. Easy Tier performs some neat performance management tricks, such as intra-tier auto rebalancing and proactively avoiding performance hot spots.
IBM’s Easy Tier testing, reported at Edge2014 and elsewhere, suggests that effective monitoring coupled with a properly designed tool can improve performance by as much as 3X. Improvements of this scale deserve the attention of data center managers. Hefty performance improvements like this should get the attention of tools vendors, not only systems managers.
Future approaches incorporate more sophisticated techniques for collecting and organizing metadata. Mainframe software developed by SAS, Ontology Management, allows developer-managers to standardize terminology across applications (it includes APIs to SAS Metadata Server, SharePoint, FAST, EMC Documentum, Endeca and others). Event ontologies such as the CIM OWL Ontology, originally developed by the Distributed Management Task Force and converted by the Dopsy Group to the Web Ontology Language, could represent a step forward.
Events previously thought to be “external” (facilities? weather?) will become better integrated into what the intelligence community has long referred to as situation awareness.
Workflow automation, such as that implemented in OMEGAMON, has to be part of the solution, perhaps more widely based on the Business Process Model and Notation (BPMN), a workflow standard. What to do when an event occurs — if any human intervention is needed — is the stuff of workflow management. But workflow automation has yet to take hold in a big way for mainframes. Will workflow automation become part of mainframe DevOps?
Yes. But to be adopted more widely, more is needed than just highly engineered event-management solutions that require specialists in system- and device-dependent development. In what Intel’s CEO referred to last year as the API-first world, IBM and its ecosystem of toolmakers will need to demonstrate still greater agility.
Big Data System Health Events
Big data is behind the approach taken in the Syncsort Ironstream-Splunk partnership. SMF data is decomposed in real time by Ironstream, then reconstructed in Splunk where it can be organized into meaningful event patterns.
Self-managing, self-healing mainframes are part of the reason that executives like Compuware CEO Chris O’Malley say “the mainframe is still as relevant as ever.” O’Malley said, “We’re trying to raise the consciousness of the CIO to look at the mainframe not as legacy but as something that can be evolved and advanced to give them competitive advantage.”
Maybe every data center needs a clinician ready to tackle the mainframe “life events” that are afflicting machine health.