The Next Frontier of Data Integration: Data Lineage and Governance
For the past few years, Syncsort DMX-h has been helping large enterprises populate their data lakes by making it easy to access legacy data coming from the Mainframe or Enterprise Data Warehouse platforms such as Teradata or Oracle and integrate it with data in Hive, HDFS, Kafka, etc.
As a growing number of enterprises successfully deploy their newly populated data lakes, we have started to hear about the next pain point: Data Lineage and Governance. In Syncsort’s recently published Big Data survey, nearly 60% of respondents who are testing or in production with Hadoop or Spark identified “including the data lake in data governance initiatives and meeting regulatory compliance mandates” as a significant challenge.
Tracking Data Movement
Data lineage tracks, at a field level, data origination (source), what happens to it (transformations), and where it moves to over time (target). Data lineage also simplifies tracing errors back to their sources in a data analytics process.
Enterprises must track data movement throughout the organization for many types of use cases including regulatory reporting, security, and auditing. This might be part of a larger data governance practice in the organization. The challenges for the organizations are: addressing the volume and variety of their data sources (e.g., mainframe, DBMS, files, external); tracking and understanding the vast movement and transformation of data; and being able to “consume” this understanding presented via a graphical user interface, or integrated with tools or technologies already present within the organization.
Cloudera Navigator is Cloudera’s Data Governance solution for Hadoop. It automatically collects audit logs from across the entire platform and maintains a full history, with a unified, searchable audit dashboard for simple, point-in-time visibility.
Syncsort has partnered with Cloudera to extend Navigator’s reach beyond the Hadoop cluster. Not only does Syncsort DMX-h access data from the Mainframe, RDBMS, or other legacy sources, and transforms those into Hadoop compatible formats, but now, with new, extended integration with Cloudera Navigator, it makes the lineage information accessible to Navigator.
DMX-h is also used for data integration within the cluster: ETL jobs created in the DMX-h point-and-click interface can be run on MapReduce, Spark, or stand-alone Windows/Linux/Unix systems. And the best part is that now, all the details of that data processing can also be published to Navigator. This means that regardless of whether the data movement and transformation process was run inside or outside of Hadoop, or some of both. Navigator shows the data lineage from beginning to end.
Syncsort DMX-h makes its lineage information available through an API that can be integrated into different Data Lineage and Governance solutions. As the first of our joint customers with Cloudera opted to use Navigator as their Data Governance solution, our engineering team worked very closely with our partners at Cloudera to implement the deep integration of DMX-h lineage with Navigator. The joint customers were involved in every step of the development process to provide feedback and ensure the integration will meet their needs.
For enterprises that are not using Cloudera Navigator, DMX-h makes the lineage information available through a REST-API that can be used to integrate with different governance solutions.
Fore more, make sure to check out our webcast from Dr. Tendü Yoğurtçu on Data Quality and Lineage.