Hadoop, Mainframe & Syncsort: Simply the Best, Part 2
In Part 1 of my interview with Arnie Farrelly, VP of Global Support and Services, we got some background on Arnie and discussed customer motivations and challenges to get mainframe data into Hadoop. Now, let’s find out how Syncsort helps, and why we built a new capability to move the data without modifying it from its original mainframe format.
A lot of tools can help you get data into Hadoop. Why is DMX-h “Simply the Best” for mainframe to Hadoop?
Well, when I talked about complex copybooks, let me expand on that a bit. One of the primary types of data in the mainframe are variable data. Historically, mainframe storage and computation was very expensive, so everything needed to be very highly condensed and efficient. Saving bits and bytes. That’s where condensed data types like packed decimal came from. There are a lot of issues with processing variable length, condensed data in Hadoop. The mainframe has this concept of record descriptor words that describe the length of the record and how you read the data. When you move that type of data into Hadoop, all data has to be unpacked, line feed terminated, and expanded into human-readable text. That presents a lot of challenges. Once you’ve done that, of course, your copybook no longer maps to the data, which means, no reliable metadata. Also, because the record descriptor word is necessary to make sense of the record, the records are not splittable. You can’t distribute them across multiple mappers in a MapReduce job. So, with any other tool, mainframe variable length data simply can’t be processed in Hadoop.
But Syncsort solved that?
We did. A feature in our newest version, 8.5, lets you move mainframe data onto a Hadoop cluster, without modifying it from its original format, but makes it distributable so you can process it with Hadoop.
Learn why mainframe data is an essential part of your data hub and how DMX-h can help break through the common barriers for data access.
Are there advantages to that way of doing it, aside from being able to process the data in Hadoop?
You don’t have to explode out variable data to its maximum size for one thing, which bloats the data hugely making processing a sluggish nightmare. Also, it makes a big difference for financial institutions. They have compliance requirements. Storing the data in Hadoop in exactly the same unchanged format is just what they need. At least one copy of the data can be preserved, unmodified for compliance. You can do that now. You don’t have to convert the data in order to store or process it in Hadoop.
That sounds good.
Our customers, especially in the financial services industry, are really excited about that. Also, there’s another aspect to this. A lot of tools can move mainframe data onto the cluster, but they do it on the edge node. They plop all the data onto the edge node, explode it out to maximum length, convert it to plain text all right there, then they move it into the cluster. If your edge node isn’t big enough to hold all that data and processing at once, that’s a problem. We use the full power of the cluster to move and transform the data in parallel. So, if you have to say, convert your entire data set from EBCDIC to ASCII, you can distribute the conversion process across the cluster. You don’t have to just work on the edge node.
Stay tuned for Part 3 of our blog series, where we talk about customer use cases and another new capability we built to support them.