Big Data Governance: Bridging the Gap between Mainframe and Apache Hadoop
This blog was originally posted on the Cloudera VISION Blog
As Apache Hadoop celebrates its 10th birthday this year, it has become the central component of the next generation data architecture. Many of the world’s largest organizations have several production workloads running on Hadoop for new revenue generating applications, to stay competitive and relevant in their industry and to become more agile and efficient. As enterprise adoption grew, so did the requirements for security and compliance.
Last year, Syncsort joined Cloudera to provide a unified foundation for open metadata and end-to-end visibility for governance. We helped our joint customers to secure and govern their data and meet regulatory compliance requirements with solutions leveraging Syncsort’s big data integration product, Syncsort DMX-h, tightly integrated with Cloudera Enterprise Data Hub (EDH), Cloudera Manager, Apache Sentry, and Cloudera Navigator.
Many of our joint customers are in banking, financial services, and healthcare. These industries have two things in common: they are heavily regulated and they have a high reliance on mainframes. Unfortunately,accessing and integrating mainframe data with Hadoop in a way that also meets compliance requirements is extremely challenging. So, we saw a great opportunity to help. But, before we get to that, you might be wondering how significant mainframes really are in the age of IoT and streaming data. So, let’s look at some data points:
- 70-80% of the world’s data either originates or is stored on mainframes
- IBM z13 system can process up to 2.5 billion transactions per day
- 71% of Fortune 500 companies have mainframes
The significance is even more apparent in our daily lives. Every time you swipe your credit card, you are accessing a mainframe; every time you make a payment with your mobile phone, you are accessing a mainframe; and of course, your social security checks are generated based on data on mainframes.
New data sources are easily captured in modern enterprise data hubs, but businesses also need to reference customer or transaction history data to make sense of these newer sources. Sensor or mobile data streamed through Apache Kafka still needs to be enriched and integrated with the transaction history or customer reference data, which are often stored on the mainframes and legacy databases.
If we leave these critical data assets outside of the big data analytics platforms and exclude from the enterprise data hub, it is a missed opportunity. Making these data assets available for predictive and advanced analytics with Apache Spark opens up new business opportunities and significantly increases business agility.
From our experience in customer engagements around the world, we know this is easier said than done. This is a complex process, fraught with governance and compliance challenges. As mentioned above, some of the most promising data analytics insights and initiatives happen to be taking place in highly regulated industries. In order to use data such as personal health records or financial transactions for advanced analytics, enterprises must be able to access it in a secure way, maintain and archive a copy in its original mainframe file format and track where the data has been.
Breaking the data silos also means challenges around data governance. Security and lineage become critical for cross platform data access. To address the data governance and lineage requirements, Cloudera introduced Cloudera Navigator, the leading Hadoop-based metadata management solution, over three years ago. Due to Syncsort DMX-h’s open source contributions and native integration in Hadoop, it seamlessly integrates with Cloudera Navigator, allowing users to search for DMX-h jobs across a unified metadata repository and view data lineage within the Cloudera Navigator user interface.
By using Syncsort DMX-h, one of the first data integration products that was certified on Cloudera Navigator and Apache Sentry, our joint customers can easily get end-to-end data lineage across platforms, accessing and processing their mainframe data in Hadoop or Spark, on premise or in the cloud. DMX-h securely accesses mainframe data, even in its original EBCDIC format, and makes it available to be processed in CDH, like any other data source. The Data Scientists do not need to worry about understanding mainframe data and can focus on the business insights. Syncsort DMX-h can make this data from hundreds of VSAM and sequential files, or from databases like DB2/z and IMS available in Hadoop. It can also map complex COBOL copybook metadata to the Hive metastore automatically.
Alternatively, the data can be kept in its original mainframe record format, fixed or variable, for archive purposes or for just leveraging the cluster for scalable and cost-effective computing. This data can then be written back to the mainframe without format changes – meeting audit and compliance requirements. In essence, Syncsort DMX-h makes mainframe data distributable for Hadoop and Spark processing. Syncsort DMX-h also secures the entire process with certified Apache Sentry integration, native Kerberos and LDAP support, and through secure connectivity. The delivery of these flexibility and strong capabilities were driven by the use cases of our joint customers.
We look forward to continue working with Cloudera to offer our customers best-of-breed data management solutions. Watch our video to see how you can easily access and integrate mainframe data into Cloudera EDH.