Meeting the Challenge of Mainframe Data on Hadoop
Editor’s note: This post was originally published earlier this year on Keylink’s blog.
Experience and customer feedback tell us that working with mainframe data on Hadoop and Apache Spark isn’t easy, but the amazing array of customer and transactional data sets stored on the mainframe makes it crucial for successful Big Data analytics and machine learning initiatives.
That’s why it’s time to liberate your mainframe data and drive better business insights.
What are the fundamental issues?
First, simply interpreting the mainframe data itself:
- Data Format – mainframe data is stored in EBCDIC format, while the preferred format for Hadoop is ASCII.
- Data Types – mainframes use a variety of binary data types such as packed decimal which need to be converted before use on Hadoop.
- Metadata – Cobol Copybooks are typically used to define the layout of mainframe data files, these can be very complex and may contain logic such as nested OCCURS DEPENDING ON clauses.
Next, mainframes are renowned for their security giving rise to legitimate concerns about Hadoop data security:
- In-flight – how to secure the transmission of data between the mainframe and Hadoop cluster?
- At-rest – how to secure access to data stored on Hadoop
Finally, governance and compliance:
- How to preserve mainframe data in it’s native format on Hadoop?
- How to track metadata and data lineage for mainframe data on Hadoop?
There’s a simple solution
There’s only one tool that can handle these problems with ease and it comes from Syncsort – an industry leader in mainframe software for over 45 years, and now a well-established leader in the Hadoop ecosystem. Syncsort DMX-h bridges the gap between mainframe and Hadoop providing:
- Access – Built-in support for EBCDIC data, complex Cobol copybooks and mainframe record formats like VSAM, fixed, variable, packed decimal.
- Security – In-flight security with FTPS and Connect:Direct mainframe data transfer. At-rest security with Native LDAP and Kerberos authentication support, plus integration with Apache Sentry and Apache Ranger for authorisation and access control.
- Governance – Land mainframe data to HDFS in it’s native format – no need for staging translated copies of the data, then track metadata and data lineage with Cloudera Navigator.
Working with mainframe data on Hadoop and Apache Spark isn’t easy, but the amazing array of customer and transactional data sets stored on the mainframe makes it crucial for successful Big Data analytics
There’s no need to hire or train new developers with specialised skills in Cobol, MapReduce or Spark. The DMX-h graphical development environment allows you to easily cleanse, blend and transform mainframe with other legacy and Big Data sources for better business insights – no coding required. You can even ingest hundreds of mainframe DB2 database tables into Hadoop at one time with the DMX Data Funnel capability.
But wait there’s more…
DMX-h already offers high-performance Change Data Capture (CDC) capability for Hadoop, but some customers would like to move only changed mainframe records across the network to Hadoop – rather than the whole data set after each update. That’s why Syncsort recently added the ability to do CDC directly on the mainframe – significantly reducing the volume of data which needs to be transferred across the network to Hadoop.
For more information, read Syncsort’s eBook: Mainframe Challenge: Unlocking the Value of Legacy Data