Data Deluge: Critical Challenges in Data Governance at Enterprise Data World
Michael Stonebraker’s keynote speech at the 2019 Enterprise Data World conference introduced the “800 Pound Gorilla in the Room”: the staggering variety of data that is flooding every organization. We often think of variety simply in terms of different formats of data such as structured vs. unstructured, database and JSON, etc. Joe Caserta noted that there are at least 24 different categories of alternative “new” data including everything from B2B transactions to pricing, from social media to reviews and ratings, from geolocation to weather and satellite data and open government data. Organizations are trying to leverage this great variety to enhance operations and logistics, develop new analytics and gain business insights for competitive advantage. Price scraping to gauge trends in competitive shoe sales and private jet tracking to identify merger and acquisition activity are but two examples.
Variety also comes in other forms, though, and not necessarily advantageous. The 75 different procurement systems in use at GE, highlighted by Stonebraker, are an example, particularly where the optimal number of systems is one. IDC noted several years ago in their report “The Copy Data Problem” that replicated data is on the order of a $44 billion issue. Stonebraker noted the example of Merck with 4000 or so Oracle databases, plus a data lake, plus files. This is a huge data integration problem to manage as organizations must ingest, transform, clean, and integrate these diverse schemas, and then deduplicate and resolve entities to get to any level of consistency. Beyond this, though, many organizations are finding that different departments are repeatedly purchasing the same external data generating further data sprawl and adding unnecessary expense.
There are multiple implications and challenges from a data governance perspective and these were significant topics and clear trends at the conference including: finding the data, determining how the data is related, ensuring it’s the right quality for use, and continuing to govern the content.
A consistent theme over the last couple years is the amount of time data scientists, business users, and analysts spend trying to find data of use and relevance, and when found sorting out which is the system of record or truth. In one session it was noted that a data scientist for iRobot spent “90% of their time finding and cleaning data, and 90% of remaining 10% checking the cleansing!” Data discovery and data cataloging are becoming key tools to help address this challenge. The former scanning the information landscape and capturing insights into the data and its content; the latter storing discovered content in a manner that provides clarity, meaning, and trust for organizational users. The goal is to support a crowd-sourced approach to data curation that allows ongoing insight into consistent and effective data use where staff are empowered and contribute to business and data knowledge.
One of the key questions in assessing a given data source is where it came from (or where it’s gone to). Increasingly this is a core requirement for regulations such as BCBS 239, HIPAA, and even GDPR as we look to understand where customer data resides. Data lineage and associated capabilities for assessing data relationships are now critical to address the challenge of scale as data is distributed and redistributed around the organization. While there exist a lot of point solutions, capturing an organizational view of the relationships across systems, applications, and integrations, particularly with new platforms such as data lakes and cloud, remains a growing consideration.
Ensuring data is of the right quality for use has been an ongoing challenge for many years. Reports over the years continually estimate that between 10-30% of productivity is lost due to poor quality data. It’s not just the old axiom of “garbage in, garbage out”, though. Now we’re starting to see added challenges with the initiatives around data science and analytics including: (perfect data → garbage model → garbage results) and (garbage data → perfect model → garbage results). We also need to think about obsolete/unused data. These sources suffer from data “rot”, indicating that the data is degrading or depreciating in value and use. Even if we’ve discovered, cataloged, and traced the lineage of the data, we need to be evaluating the ongoing use and relevance of sources not just because we want the most current data, but also because such data can become a prime area for fraudulent use.
Data cleansing is tough. And targeting the right data to correct and maintain is challenging, particularly given the ongoing increase in sources and variety of content. The only way to address this is to establish and get effective with data governance. Without insight and review of data, we will continue to utilize data sources that are simply available or known about, rather than the best data sources. It’s important to establish ongoing monitoring of data – what’s been captured and identified, when it was created, whether there is any associated data lineage, and what the quality of that data is. And, as part of monitoring, that information needs to be reported and presented in a manner that scales out.
This data deluge is not going away. This is a challenge for anyone dealing with the volume and variety of data coming into an organization. As we look to democratize data, to encourage people to use and make core business decisions based on data, we need to ensure that the processes, people, and tools are in place to reduce the manual effort required to identify, manage, validate, and govern the data.
Download our white paper to find out more about the keys to data governance success.