The Self-Managing Mainframe and How Tipping Points Surprise Us
The concept of a “Tipping Point” was introduced by Malcom Gladwell in his book of the same name. Gladwell defines a tipping point as “the moment of critical mass, the threshold, the boiling point”. The idea is that enough little things happen to add up to an irreversible change after which things happen more quickly and more visibly. Similarly, the recent announcement from Compuware about their partnership with Syncsort is a small step that is part of a bigger trend. It takes us nearer to the tipping point and the result may surprise us all. Let me explain.
Automating IT Operations
For years now IT has been talking about advanced automation using goal-seeking behavior. We have tried, event-based automation, state-based automations, policy-based systems and autonomic computing; all attempts to reach towards the goal of a self-managing mainframe. IBM has often led the charge in these areas, at least in talking about them! Their current approach is labelled “Cognitive Computing” and has been birthed out of enthusiasm for the Watson technology that has had some success and could now be applied to IT management scenarios.
The goal is always to reduce the cost and risk of manual interventions. Given that computers can fly planes it does seem reasonable that computers should be able to operate computers. Seems reasonable, but has so far been out of reach to us in the mainframe world! Although we do see “lights out data centers”, these are usually controlled by humans remotely more than they are by self-correcting automation.
The Trouble with Tribbles
The 44th episode of Star Trek was called the “Trouble of Tribbles” and is apparently one of the most watched shows. While they were annoying in many ways, the root cause trouble with Tribbles was the rate at which they multiplied which is the exact opposite of the trouble the IT industry has with the skills needed to manage the mainframe base into the future.
The Boomer-based demographics of the mainframe workforce has been viewed as a problem for over a decade now, but we have yet to address it effectively. The big software companies all have their graduate training initiatives and most companies using mainframes have been cross-training younger staff too. But the root cause issue is that mainframes are complex systems that are not easily understood at the level needed to manage them when things start to go wrong. The problems can multiply faster than your ability to fix them, if you lack the required experience.
The Soul of the Machine
On to the scene comes Splunk, the market leader in machine learning from IT log data. This now seems to be the right answer both to how we automate complex, interconnected systems and how we can pass the experience and knowledge of the Boomers onto the Millennial IT generation.
In practice Machine Learning is very different from what we usually think of as Artificial Intelligence. AI seeks to build computer models that can emulate the functions of human brains. We expect that an AI would perceive its environment and exhibit goal seeking — purposeful behavior that is understood by humans. Ideally it would interact with humans to both receive input and augment our decision making abilities.
By contrast Machine Learning is a sub-area of AI that is focused on pattern recognition that allows the system to “learn” and predict based on history, but without their being a rational explanation for that response that a human could understand. Machine Learning relies on the consumption of masses of granular data that can be processed with statistical analysis to make predictions and uncover “hidden insights” about relationships and trends. But these “insights” are not necessarily causalities that have an explanation that humans could understand and replicate.
Using machine learning, Splunk apps can peer into the soul of a mainframe in a way that point management tools can’t. The correlation of data from a broad range of sources both on the mainframe and from other platforms can allow the user to see the full context and interaction of events around a problem. Eventually this can lead to the self-healing, automation we have dreamed of.
Splunking the Mainframe
Syncsort has been helping people Splunk their mainframes for over two years now and we have learned a lot along the way. Our Ironstream product can access machine data from almost any known source on the mainframe and there is gobs of it! The data is transformed to be ready for ingestion by Splunk and users can filter what they consume to keep their ingestion based costs under control.
The users of Splunk solutions in general are Enterprise IT teams using data from as many platforms as possible in the quest for an end-to-end view of activity. There are several generalized uses cases from Security and compliance to IT operations and capacity planning. But the power of the Splunk Enterprise platform is such that each customer can build the solution they need very easily
With the addition of mainframe log data the Enterprise teams truly have the landscape covered and can visualize the key business services (e.g. online banking) with end to end monitoring. MF IT is beginning to see the power of this too now that their data is going into the pot.
Compuware’s first toe in the water with Splunk involves the ingestion of SMF records cut by Abend-Aid when an application, or other z/OS program abnormally terminates. The data represents critical fault management information that can be integrated into the DevOps cycle of continual improvement. With this data in Splunk users will have a historical record of faults with failure codes, causes and details about the code that has failed including the last compile date. Special information is added when a CICS transaction abends recording the transaction ID, and caller details.
Talking with CEO Chris O’Malley about Compuware’s strategy, I expect to see more Splunk based offerings coming from them and Syncsort is certainly committed to helping them with these plans.
New tools for the next generation
For at least 10 years the incumbent tool providers have talked about modernizing mainframe management. IBM has tried with its Tivoli strategy and CA Technologies tried with its now abandoned Chorus strategy. Numerous vendors slapped browser-based UIs on old technology.
It’s easier with hindsight to see why these tools failed. The “old guard” mainframe boomers just prefer the 3270 screens they grew up with; they can work faster with their incredible experience and years of using finger picking shortcuts. Those new to the mainframe could rarely do their whole job on the new UIs which also didn’t simplify the task much anyway. The platform was still hard to learn and the new UIs actually ensured they worked slower than their mentors. Bizarrely the most successful next gen mainframe workers were those that embraced 3270 copying those training them.
The promise of analytics based management tools takes a giant leap ahead and bypasses these points of failure. The new UI that modern IT workers expect is there, but the real point is that the task is transformed.
As the first industry solutions emerge we see that analytics based solutions complement the existing point management solutions more than replacing them. By gathering broad data to provide rich contextualization they will be useful to Mainframe IT, both old guard and new workers alike. Compuware’s new free app that will collect and visualize application fault data gathered from Abend-Aid is an example of this. Bundled with the Ironstream product, Syncsort offers many other free visualizations of data sources like SYSLOG and RACF access violations.
As these next gen tools mature, progressive IT departments will tend to unify the traditional mainframe/distributed split and the MF IT teams will start to request functions that replace the old tools. They won’t replace them screen for screen and function for function, rather they will make functions obsolete because analytics based automation will increasingly make manual observation and intervention unnecessary and probably counterproductive.
I see a parallel with self-driving cars. Today they seem a bit fantastic and hard to trust. But I suspect we will adopt slowly at first until a tipping point when everyone realizes the technology has matured to the point that it is safer to keep people away from the controls. When the tipping point is reached for this next gen mainframe automation I suspect that the analytics will be moved back to run on the Z platform somewhere. For a system as dense and powerful as a mainframe you want critical automation on platform. But for now the early tools will be where the action is and that’s on distributed Linux or cloud.
Similar in some ways to a “Tipping Point” is a “Paradigm Shift” as described by the American philosopher Thomas Kuhn, although the latter was describing major transitions in scientific frameworks. As the paradigm shift gets under way there is always resistance to change by the “old guard” until they are eventually overcome and they align themselves to the new order.
I am optimistic that MF IT teams will undergo a natural evolution of acceptance as the value is demonstrated. I suspect that the real resistance might be from the incumbent tools vendors who have not prepared for the changes and find that it impacts their business models. Under O’Malley’s leadership, Compuware is clearly not one of these, but rather is seeking to be on the front-end, leading the way in applying analytics to a concept of DevOps that bridges mainframe to mobile.
Similarly, Syncsort seeks to play its role in the Digital Transformation by helping customers and vendors undergo the difficult transitions of the data driven economy. If you have a mainframe, let us know when you are ready for the next steps you need to take. Or experiment on your own by downloading a copy of Ironstream that is free to use with Syslog and Abend-Aid data. Good to get started soon, the self-managing mainframe will be upon us sooner than you think.