Six Steps to Overcome Data Quality Pitfalls Impacting Your AI and Machine Learning Success – Part 1
This article on overcoming data quality pitfalls was originally published in Enterprise Executive. Part one of this two part post focuses on the problems encountered with data quality, the comprehensiveness and completeness of data, and the accuracy of data.
We are used to traditional applications and systems which use a very specific, deterministic approach to decisions about data. Data enters a pipeline, specific functions and rules are applied to transform and enrich the data, and the results are reported for us to make decisions on. Different data may trigger different rules, and reference data may be updated, but the source code and processing steps are pre-determined.
This paradigm has changed with the growing use of AI and machine learning. In this new era, the data fundamentally shapes the algorithms and models – functioning as part of the “source code.” With machine learning, the algorithms and models processing data and generating analyses for decisions can actually be directly changed by the data used both in initial training and ongoing operation. And with black-box models, we don’t even have insight into how the algorithms have been changed, or why.
The 2019 survey by Dimensional Research “Artificial Intelligence and Machine Learning Projects Obstructed by Data Issues” highlighted that nearly 80% of AI and machine learning projects had stalled, and that issues with data quality and other quality issues in data labeling were the barriers. Without high quality data available for both training and evaluation, the algorithms and models will remain suspect, or even lead to erroneous outcomes and conclusions. We see this issue most commonly in cases such as facial recognition, but it is a critical issue for effective adoption.
This is not to say that we should step back from AI and machine learning. These capabilities offer us ways to unlock information hidden in complex and voluminous data. They are able to identify correlations and signals that humans just cannot detect on their own. Whether we consider use cases such as fraud detection, risk assessment in financial services and insurance, or even consumer buying behaviors, the opportunities are considerable.
Here we face a couple of key challenges. First, how do we ensure that we have quality data for use that does not obscure the signals we are looking for? And second, how do we gain transparency into the information generated so that we can “debug” our algorithms and models as needed?
AI and machine learning need clean, well-populated data to function properly and find insights. Yet companies have to figure out how to ensure their data is of the highest quality, and free of bias, without damaging the hidden data patterns in the process.
Organizations must always consider how they approach the problems they are looking to address and identify the data relevant to their AI and machine learning initiatives. They have to deliberately work through a number of steps to get the results they want:
- Identify the problem the business is trying to solve by using AI and machine learning.
- Determine the hypotheses to evaluate for that problem.
- Find the data needed to test the hypotheses and assess in relation to the problem.
- Examine whether the data is biased, accurate, or missing key information, and if there is enough data that shows the desired pattern.
This is not necessarily an easy task. Context and understanding of the industry or domain from which the data originated is crucial to establishing useful hypotheses and ensuring data quality. Entity resolution, and de-duplication when combining multiple data sets, are particularly important challenges. For instance, in the realm of healthcare, a John Doe entry is not usually the name of a person (although it can be!), but an identifier for areal, but unknown male who has received actual services for an actual condition, often in the emergency room. It’s not as simple as just purging these entries from the records (nor linking the numerous, but likely distinct John Does together).
As another example, in the IoT realm, a given sensor may display a reading of -200 degrees Fahrenheit. But this may be an error code rather than a temperature. Having the context to understand the data is thus instrumental to ensuring accuracy, as well as determining best steps to address the data issue.
We usually think of data quality processes as a method to remove defects from data. But in a machine learning and AI context, how can you tell what’s a defect and what’s not? In advance (and often even in hindsight), it’s difficult to know which features of the data may be useful to the algorithms. Consequently, we need to think about the problem and a broader set of data quality dimensions before we apply standard data quality methods.
Comprehensiveness and Completeness
Traditionally, completeness as a dimension of data quality simply considers whether a given field or attribute is populated. Organizations have to be well aware of data that may be missing from a data set they are going to use. For instance, postal codes could be missing from shipping orders, or time of purchase might be missing from in-store receipts. Regardless of what is missing, companies must be aware of how they normalize or correct this missing information – as soon as a decision is made on how to move forward with or without the missing data, the subsequent resolution may change the AI or machine learning algorithm’s results based on the decision-maker’s understanding of the problem. Users of the algorithms thus must be vigilant about how they are influencing data and affecting data quality. That is true whether the missing data is corrected or not – bias can be part of the data set either way.
Completeness of the given fields may not even be sufficient to address a given problem. If the challenge concerns increasing the number of potential prospects who buy a given product, or potential agents committing fraud, a broader population segment may be needed than what is in the current data. This extends the consideration of “completeness” into “comprehensiveness” – organizations must consider whether the need to obtain additional data (e.g. website visitors) where the data does not have the same characteristics as that of what they have captured in their core operational systems.
Accurate, but Irrelevant
Additionally, companies must be aware that a data set could be entirely accurate but completely irrelevant to their desired hypothesis. For example, to return to the issue of crime statistics, data about past arrests for white-collar and blue-collar crime might largely lead to predictions that future crimes would occur in a city’s richest and poorest locations, respectively. This is entirely accurate information, but it does nothing to inform the analyst about where future crimes will occur. Or, when examining the response to a disaster, if algorithms used social media content to determine which areas were hardest hit by a cataclysmic event, as was seen with Hurricane Sandy for segments of New Jersey, the results would largely be misleading as no data about areas without power, where people could not post to social media, would have been captured.
Thus, when determining data quality for AI, the accuracy of the data set (in the broad sense of the potential pool or population of data) is essential to determining whether the AI and machine learning can produce informative and useful results. Users must be sure that their data sets have all the signals they need, and also be willing to continue asking questions about the data and the outcomes.
Check back for part 2 where we focus on bias in data, reliability of past data, and the road to data quality.
Understanding which elements of data quality and data integrity matter most helps you get more out of your data. For more information on the state of data quality, take a look at our survey.