Data is Inherently Messy. Is That Really Such a Bad Thing?
Editor’s note: This article on data management written by Syncsort’s Harald Smith was originally published on Infoworld.
In an imperfect world, consider shifting your data quality mindset from “how do I clean all this up?” to “how do I make the most of this state of affairs?”
A data quality expert once told me that vendors providing data quality software solutions should always ensure 100 percent quality data, and if they didn’t, they should be liable for any ensuing issues. I disagreed with that harsh assessment then—and still do. The truth is, sometimes 100 percent data quality isn’t necessary and could even hinder an organization’s ultimate business goals.
As much as you would like our data to be perfect and pristine, to conform to your established dimensions of data quality, it isn’t. While there’s been renewed focus in recent years on the importance of data quality for achieving higher-value data and improving machine learning, data quality is not a new problem. Tools to address data quality have existed since at least the early 1990s, and MIT held its first International Conference on Information Quality back in 1996.
After 20 to 25 years, you might expect that we would have mastered data quality! So why is 100 percent complete, clean, consistent, and accurate data still so difficult to achieve?
The answer lies in changing your mindset: Data quality is contextual, not universal. It’s time for us to accept and expect that data is messy: incomplete, nonstandard, inconsistent, inaccurate, and out of date—but that’s not necessarily a bad thing. By understanding the contexts that make data messy, you can focus your efforts on addressing data quality issues where they are most critical, and to tolerate the rest where other factors are more important—in other words, put data quality in the right place at the right time.
Good data or bad? Context matters
Not all data is created equal. We all have names—identifiers by which we are recognized. In seminars I’ve given, I’ve asked the question: “Is ‘John Doe’ good data?” Almost unanimously, the answer is no because it is considered fictitious and often used as test data. Yet “John Doe” is common and valid in health care or police investigations as the name for an unknown male (someone who does or did actually exist), in legal cases, as part of a Twitter handle for more than 100 people last I checked—not to mention there are real people with that name. The name John Doe is complete, consistent, and can be accurate. But you need to understand the context before you can say whether it is good, bad, or simply needs additional processing logic.
Numeric values and dates can be equally challenging. Just think about a rating scale from 1 to 5. Is 1 the best rating, or is 5? Or a value of 100—is that a perfect grade, a high Fahrenheit temperature, an age, or an invalid credit rating? You need context (supplied via documentation, help, policies, metadata, etc.) to understand the data correctly, and to implement the right data quality checks and rules. You must then determine whether there is a data quality issue at all, and if so, whether it’s one around which you need data quality measurements and processes.
Consistent data? Keeping the systems running
How you incorporate data into your operations and systems is another factor impacting your consideration of data quality. Building custom applications for every organizational function is expensive. Over time, you’ve replaced many of these with software packages and even suites of systems such as enterprise resource planning (ERP) products. Each of these products, as well as your homegrown applications, have systemic requirements. Enforcing a single, consistent organization-wide standard, whether for dates (annual calendar vs. timestamp vs. Julian date), Boolean values (T/F vs. Y/N vs. 1/0), or other codes, would be quixotic at best and otherwise resource- and revenue-consuming. The same is true for third-party data, including the increasing variety of open data available.
The definitions and semantics of data impact consistency of data as well. The definition of “customer,” for example, may vary depending on whether you are in marketing, order fulfillment, or finance. These semantic variations are more challenging because the data may look the same but produce different and inconsistent results (particularly in aggregated content) depending on inclusion or exclusion. Business glossaries, data catalogs, and system or operational documentation are imperative to ensure communication and effective data use.
It’s when you attempt to merge or integrate these sources or systems that data quality issues around consistency emerge. When deciding how to make sense of data that appears contrary, remember that systems are created for different needs at different places at different times, and with differing semantics. As you work to bring this disparate data together for new purposes, you need to establish which source becomes your system of record for each piece of data, where you need reference data, and ensure that the integration processes put in place reconcile and resolve the data differences and inconsistencies.
When service matters more
An accident scene or an emergency room is the most obvious example of a time when data quality is of secondary importance to providing the necessary service. You don’t refuse medical attention because you don’t know the injured person’s name, address, and date of birth. Sometimes the issues and needs are subtler, though. A good example comes from Feeding America, an organization that uses data and analytics to facilitate food distribution to those most in need. To extend its services, it implemented pilot programs to improve data collection and data quality. As it assessed those pilots, it found that where the upfront data collection became too time-consuming; hungry people would actually leave the line without food even if waiting longer would mean getting food. The organization realized its quest to improve data quality conflicted with the organizational goal to best serve the underfed in America, and consequently scaled back the data collection effort to ensure optimal service.
New and emerging needs
Other business-driven requirements can create other, often divergent needs around data. To ensure accuracy in matching customer data, the more pieces of related data you have, the better. But in machine learning, including multiple pieces of highly correlated data can skew findings rather than finding something new. A good example presented by Ingo Mierswa on a data set for the Titanic disaster sought to show what factors most predicted survival. Because the data set included whether the individual was in a lifeboat or not, that factor was selected as the predictive factor. But that is a highly correlated piece of data and therefore skews the data. You will find similar high correlations simply by including both geolocation and street address for a customer. The data may be complete, consistent, and accurate, but still produce data quality issues.
Each step or process in an information supply chain includes factors or forces that must be addressed. The further you move along in that supply chain, the more constrained you are by prior decisions. There is a lot of freedom in what you initially collect and what requirements you put in place there. But as data is integrated and consolidated and then delivered to subsequent systems for reporting and analysis, you must account for the varied upfront requirements, and this is what makes information management challenging. For instance, does it make sense for an organization to forego a sale by requiring an online order to include every demographic detail about a customer? No, because only a few pieces of data are required to fulfill it. But for that same order, will you test and validate the quality of the delivery address information? Yes, because that data is critical to high-quality customer service and successful completion of the order.
Learning to dance with data
You’ve spent a lot of time and money building data warehouses, data marts, master data solutions, and now data lakes to facilitate data-driven insight that you can trust. There is no one-size-fits-all data quality solution—but in a way, that’s a good thing. As I’ve described, targeting 100 percent data quality across the board is an unproductive aim because organizations, and even different lines of business in organizations, have different goals and address differing needs and constraints.
But there are actions that you can take to make the most of your data and ensure your data quality resources are being directed to the right places:
- Accept the reality that data is messy.
- Build up a reference of what data is available and what is useful for what purposes (including identifying sources of record).
- Make tools available for users to adopt right-time data quality practices and approaches, and ensure your employees are data literate by communicating data policies and practices.
- Understand where you are in the information supply chain and how you can draw on common patterns of information management to best leverage (and protect) data you want to use.
- Build up and use common objects that are clear and understandable for transforming, consolidating, and merging data to facilitate consistent preparation and outcomes.
- Establish measurements of data that is critical for your needs and analyze and evaluate it against business requirements.
- Assess the fitness of data for each purpose and recommend actions:
- Is this the right data with the right context?
- If so, what needs to be done to ensure it is of the right quality and fit for your purpose?
- Communicate findings so that others do not have to repeat the process.
When developing your organization’s data quality strategy, knowing where it matters and where it doesn’t is half the battle. Understanding that means you’ll be deploying data quality resources in the right places, reaping the benefits of data quality in meaningful ways while not wasting time and energy where it doesn’t.
Check out our eBook on 4 ways to measure data quality.