As Hadoop takes its place as a data processing platform, one of the major challenges is data ingestion. The enterprise applications require integrating data from many sources, where these sources often originate from multiple platforms.
Many of our customers have business applications integrating data from Mainframe and databases like DB2 and Teradata, where data is prepared and aggregated in Hadoop and loaded into an analytics platform. Though the number and variety of sources may differ, this is a repeating pattern we see with our customers.
Syncsort’s Hadoop products simplify this potentially very complex step. Through Syncsort’s graphical user interface, users can specify a variety of data sources, whether it is from Mainframe or Teradata, or Salesforce, etc., along with the metadata, COBOL copybooks and table schema. This abstraction of platforms and data sources becomes a very powerful tool for the data scientist who doesn’t want to spend time getting the different data sets to Hadoop, and instead wants to focus on the analytics and get more insight from the data once it is in Hadoop.
However, it doesn’t end there. Syncsort’s unique value proposition is due to the ‘native’ integration of its Hadoop products, that is, the data is processed within the Syncsort’s engine and not as part of a generated code in the form of a Hive query or a Pig flow. Syncsort has been a contributor to the Apache Hadoop project to offer solutions that can best serve its customers’ needs while leveraging Hadoop as the data processing platform. Syncsort’s engine is running ‘IN’ Hadoop as part of the native data processing flow, and it can easily integrate into a framework to move data in a parallel fashion. For example, it can access data from each of the Mappers in a Sqoop like manner and read partitioned data sets in parallel from a relational database and move to HDFS. Likewise, it can write to a MPP database from each of the Reducers, pushing data in a parallel fashion.
We repeatedly come across the use case for moving multiple data sets from legacy Mainframe systems to Hadoop and feel this is a great opportunity to contribute to Apache Hadoop project and share Syncsort’s Mainframe expertise.
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases. Syncsort submitted a patch (SQOOP-1272) to extend Sqoop for transferring data from Mainframe to Hadoop, allowing multiple Mainframe data sets to be moved to HDFS in parallel. Each data set will be stored as a separate HDFS file and EBCDIC encoded fixed length data will be stored as ASCII encoded variable length text on HDFS.
Parallelism comes into play when one wants to move multiple data sets, referred as Partitioned Data Sets (PDS) on z/OS, to a single HDFS file, since there is no natural way to split a mainframe data set. One use case for this is the IBM DB2 dump files on Mainframe from several tables.
The open source contribution will provide an implementation for the new Mainframe Import tool. This implementation allows the user to specify a directory on mainframe and move all the files located in this directory to Hadoop in a parallel fashion. The files can be stored in any format supported by Sqoop. The details of the command line options are included in the SQOOP-1272 design overview. The user can control the level of parallelism by specifying the existing Sqoop option for number of mappers.
We are also excited to announce a more advanced implementation of this interface with support for all Mainframe (z/OS) record formats with the ability to specify COBOL Copybook metadata, and VSAM file formats. Syncsort’s plug-in provides a feature-rich version supporting all data types, e.g. Packed-Decimal, etc. and translation of EBCDIC encoded fixed length binary data to ASCII encoded variable length text in HDFS. It also simplifies archiving of Mainframe data in Hadoop in any Sqoop supported data format. For example, one can transfer the original Mainframe data into Hadoop without any translation, addressing the use case for treating Hadoop as a data lake and keeping Mainframe data in its original format for compliance reasons.
For a demo of this functionality, visit us at Kiosk G2 in Hadoop Summit!