3 Typical Technical Challenges During Hadoop Implementation

Overcoming Technical Challenges in Large Mainframe to Hadoop Implementations

In the past couple of years, we’ve seen tremendous growth in demand for Syncsort DMX-h from large enterprises looking to leverage their Mainframe data on Hadoop. Syncsort DMX-h is currently the recognized leader in bringing Mainframe data into Big Data platforms as part of Hadoop implementation.

A large percentage of these customers have come to us after a recommendation from one of the Hadoop vendors or from a Systems Integrator, the hands-on people who really get the technical challenges of this type of integration. Some people might wonder why the leaders in this industry recommend Syncsort, rather than some of the giants in Data Integration. What kind of problems trip those giants up, and what does Syncsort have under the hood that makes the hands-on Hadoop professionals recommend it?

To get an idea of the challenges that Syncsort takes on, let’s look at some Hadoop implementation examples and what technical problems they faced.

Use Case 1: How Mainframe Data Can Go Native on Hadoop

A large enterprise was masking sensitive data on the Mainframe to use in mainframe application testing. This used a lot of expensive CPU time. They were looking to save the cost of MIPS by doing the masking on Hadoop, then bringing the masked data back to the Mainframe.

DMX-h has a unique capability to move Mainframe data to be processed in Hadoop without doing any conversion. By preserving the original Mainframe EBCDIC encoding and data formats such as Packed-Decimal, DMX-h could process the files in Hadoop without any loss of precision or changes in data structure. The original COBOL copybook was used by DMX-h to understand the data, including Occurs Depending On and Redefines definitions. After masking sensitive parts on Hadoop, that same copybook still matched the data when it was moved back to the Mainframe.

DMX-h Use Case: Mainframe Application Test Data Management

Keeping the data in its original Mainframe format also helped with governance, compliance and auditing. There is no need to justify data changes when the data is simply moved to another processing system, not altered.

Use Case 2: Efficiently Combining Complex Mainframe Data with Diverse Data Sources for Big Data Analytics on Hadoop

Another common use case Syncsort DMX-h does routinely is to move Mainframe data to Hadoop so it can be combined with other data sources, and be part of analytics run in MapReduce or Spark. We’ve seen plenty of situations when a customer has tried different solutions to ingest Mainframe data but has hit roadblocks that stalled its Hadoop implementation, where DMX-h easily handles the situation.

In one example, a customer had nested Occurs Depending On clauses in their COBOL copybooks. Their existing solution was expanding each of the records to the maximum occurrence length. This was causing the size of data to blow up hugely on Hadoop, and the ingestion was painfully slow. With DMX-h, the records were kept at their intended size and the ingestion proceeded at a much faster rate. The ingestion completed in under 10 minutes, as opposed to 4 hours with the existing solution.

In another example, the customer had VSAM data with many segments. Their existing solution was reading the VSAM file once for each segment, which was taking a lot of time and using up expensive processing on the Mainframe. If a VSAM file had for instance, 5 segments, the other solution had to read that same file 5 times over. DMX-h reads VSAM only once, partitioning the data by segment ID using a field on the copybook and then splits the data into separate segments, allowing each segment to be mapped by a different copybook, for further processing.

Download our free eBook -- Bringing Big Data to Life: Overcoming the Challenge of Legacy Data in Hadoop

Use Case 3: Simplifying Mainframe Data Access

We have helped users who discovered that accessing Mainframe data isn’t as easy as other data formats. For example, they might have found that their tool doesn’t handle Packed-Decimal or LOW-VALUES in COBOL, or cannot transfer data securely from the Mainframe using Connect:Direct or FTPS.

Another big challenge during Hadoop implementation is getting the COBOL copybook to match the Mainframe data. The original COBOL developers may be long gone, and there is no one around who can fix the copybook. Our Professional Services team sees that all the time, and helps enterprises correct copybook problems so the value of the data can be fully realized.

In these and other practical situations, customers have told us they chose DMX-h as the faster, easier to use, and more cost-effective solution. Mainframe sources can be accessed and processed with DMX-h with just a few mouse clicks. The simplicity makes hard problems look easy.

We’ve spent a lot of time listening to our customer’s pains and aspirations for their Mainframe data. Our 40+ years of Mainframe and Data Integration expertise combined with active involvement in the Apache Hadoop community resulted in DMX-h’s strong Mainframe Access and Integration functionality. Another thing that sets us apart is our mature set of features related to security. Our ease of integration with Kerberos, and easy handling of encrypted or compressed data make a huge difference in production implementations.

For some more specifics, read where Arnie Farrelly, VP of Global Support and Services, has recounted some of what his team has experienced when working with large enterprises trying to leverage their Mainframe data in Hadoop.

And here is a short video demonstrating how to use DMX-h to access Mainframe data and integrate it with Hadoop.

Fernanda Tavares

Authored by Fernanda Tavares

Vice President, Data Integration R&D

Leave a Comment