From Source to Analytics: A Farm-to-Table Approach to Data Governance
Editor’s note: This article on data governance written by Syncsort’s Tendü Yoğurtçu was originally published in Information Management.
Farm-to-table conjures up images of fresh produce delivered directly from the garden to the kitchen. A foundational part of this concept is knowing where the food is grown, where it’s traveled and if it’s safe to eat.
As it turns out, a similar approach can be applied to data governance. Farm-to-table is one way IT experts can frame how they’re governing data and evaluating the tools and technologies necessary for meeting compliance requirements and use cases like anti-money laundering or credit card fraud.
As data and IT professionals think about the “data lifecycle” – from origin of data to delivery in next generation analytics platforms – they must understand the importance of a strategic approach to data governance through security and data quality practices. This ensures data arrives fresh and clean for advanced analytics and can be easily tracked.
Perhaps the most important consideration for farm-to-table food is deciding if it’s safe to eat. So too is confirming the security of data for your organization.
The requirements for data availability and security are becoming more complex and broad thanks to the data explosion of the digital era. Organizations are forced to review their data availability and disaster recovery policies covering global data centers on-premise or in the cloud to plan for unexpected outages.
Companies must implement strategies and technologies to guarantee data is secure, available 24/7 and protected. For highly regulated industries such as banking or insurance, security is more important as data must satisfy specific, stringent regulatory compliance requirements. Yet for all industries, meeting privacy, confidentiality and compliance requirements on the growing sets of data is critical and challenging.
The General Data Protection Regulation (GDPR) and similar regulations require companies to implement best practices around data encryption, masking and anonymization to ensure their data lakes, data-as-a-service and data marketplace environments are as managed and secure as possible.
Securing data in the most optimized, scalable and efficient way possible while meeting SLA requirements and providing real-time insights is a key challenge. Organizations must prioritize selecting data products that are flexible and easy to integrate with security frameworks.
Just as consumers want to know if their food is fresh, so do organizations need to know their data is trusted and fit for purpose. As data flows in and out of the data lake, businesses must understand the sources of and relationships between data and set up business rules to continuously measure quality.
An effective data quality policy will take into consideration missing, outdated or redundant data and inconsistent formats to create consolidated, clean and verified data for analytics and reporting. Ultimately, bad data leads to incorrect insights.
The increased adoption of machine learning for mainstream use cases like fraud detection and more transformative use cases like analyzing social networks for customer insights must rely on trusted data sets. The impact of bad data is multiplied for next generation analytics, especially because the data is used to train algorithms to make recommendations based on new data sets.
For example, entity resolution – distinguishing matches across massive datasets that indicate a single specific entity – requires sophisticated multi-field matching algorithms and significant compute power. To meet these evolving use cases, data quality, cleansing and preparation routines must handle rich data formats with growing data volumes and be highly scalable.
Advanced business and operational analytics also require data to be kept in sync with the data source. Tracking and detection must happen rapidly, as current transactions need to be continuously added to combined datasets and prepared and presented to models as close to real-time as possible.
With the growing adoption of data-as-a-service and platform-as-a-service architectures including cross-platform, hybrid cloud environments and multi-cloud implementations, creating visibility into data becomes crucial. Knowing where the data originated, what transformations it has gone through and how many copies of the data set are replicated across multiple data stores is necessary both to serve compliance requirements and to optimize resources. For a use case like anti-money laundering, financial institutions require complete, detailed data lineage from origin to end point, making it a core data governance element.
In selecting data lineage tools, IT professionals should guarantee solutions offer a comprehensive and granular view of where the data has been and what’s been done to it along the way, regardless of whether data movement and transformation ran inside or outside of Hadoop, the data lake, on-premise or in the cloud.
As IT professionals create a strategic approach to data governance to deliver meaningful insights and meet compliance requirements, they’re placing greater emphasis on security, data quality and data lineage to guarantee trusted and current data is used for analytics. Current trends in analytics with machine learning and AI are becoming widespread, and these strategic considerations are integral as companies look to achieve greater insights and more value from their data.
Fore more, make sure to check out our webcast from Dr. Tendü Yoğurtçu on Data Quality and Lineage.