Six Steps to Overcome Data Quality Pitfalls Impacting Your AI and Machine Learning Success – Part 2
This article on overcoming data quality pitfalls was originally published in Enterprise Executive. Part one of this two part post focused on the problems with data quality, the comprehensiveness and completeness of data, and the accuracy of data. Part two focuses on bias in data, reliability of past data, and the road to data quality.
Being Aware of Bias
When dealing with data quality, companies must also be on guard because there are so many ways to introduce bias into AI and machine learning.
For example, there can be bias in the data set from the start. An example of this is using police data about past high crime areas to predict where crime will occur in the future. Training AI on this data could simply reinforce any racial or socioeconomic prejudices that caused those areas to be the focus for so many arrests.
Or there may be valid but inconsistent data collection processes. For instance, data entry staff at one facility may have left marital status blank when entering orders for people whose salutation is Mr. but included it for those who used Mrs. or Ms. If such data is used for training AI or machine learning, preferences for certain products might be falsely correlated with marital status.
Past Isn’t Always Precedent
Using historical data can also introduce bias into predictions about the future. For instance, if a company were examining the best people to hire based on past data sets about successful employees, there might be a significant bias towards men, as hiring practices in the past often discriminated against women.
What’s key to all this is that organizations must do everything they can to get insight into the data they wish to use and understand how it may be reasonably gathered, cleansed, and used. One clear lesson of the big data era has been that relying on bad data means you end up with unreliable results at best, and potentially harmful results to individuals or even the organization’s reputation. That’s why even in a time when AI and machine learning can greatly streamline the ability of a company to find signals in big data, companies must focus even more on ensuring they have quality data than they have in the past. It is important to capture as many of these deeper insights as possible before feeding into the AI or machine learning algorithms.
No one would expect a system or data pipeline to be implemented without testing and debugging. Similarly, good data quality doesn’t just happen. It must be consciously and carefully considered, curated, tested, and rigorously questioned and evaluated.
The Road to Quality Data
Once an organization has identified how the data they are using for machine learning could be compromised through biased, missing, or inaccurate information, the next step is to decide what can be done to the data set to make it as accurate as possible. There are six key steps companies can do to improve their data quality.
1) Data Profiling
Assessing data through data profiling is centerstage to help organizations determine whether using the data for machine learning will produce viable and useful results. Profiling means exploring and understanding the available data, both at a general level and with a deeper eye that looks for outliers, segments of data, and other conditions. This is when organizations should look to see if there are missing values, clear indications of bias, or signs the data may be irrelevant to the issue at hand. Profiling can also provide indications of how complete the data set is as a whole vs. the needed data population.
2) Apply Standard Techniques
As a rule, improving data quality is more likely to boost real signals than to increase noise. Standardizing data so that it appears in the correct fields, for instance, provides a much better signal. Entity resolution can mean the difference between machine learning looking at Bob Smith, Dr. Robert Smith, and Rob Smith as three different people versus having a richer set of information to analyze about Dr. Robert Smith.
3) Dependency and Correlation Analysis
To determine the fitness of a data set prior to feeding it into AI or machine learning algorithms, companies can also utilize dependency and correlation analyses. They can then investigate the findings to uncover if those dependencies and correlations that are found are spurious and not actually signs of causation, or actual correlations (e.g. city and postal code) that may produce overfitting of training models.
4) Matching and Deduplication
Companies can utilize their data matching software to not only discover common entities, but to evaluate sets or segments of data for distinct conditions such as invalid or non-unique identifiers. This type of analysis can indicate what type of matching techniques may be relevant for the problem, what additional data may be useful or needed, or where duplicated data occurs that should be removed prior to use.
5) Introspection and Evaluation
Prior to changing any data sets, companies should stop and consider how those changes will affect the overall output of the data. Will they add in bias? If so, in what way? On the other hand, will data standardization help eliminate existing bias? Organizations need to constantly ask questions about the data and whether they are really testing what they intended to test with the data they have available for machine learning.
6) Data Enrichment
Once a data set has been profiled, and the level of its comprehensiveness determined, organizations can then decide whether to enrich it with additional data (e.g., demographics, firmographics, or geospatial and other location data). For instance, a company may know that using postal code data on a particular set of orders does not provide all that much insight. However, if they can incorporate latitude and longitude for the actual delivery addresses into the original data set, this type of enrichment could enable location analysis that identifies other location-based factors. Enrichment gives companies the ability to see the data in new ways.
Bringing It All Together
Ultimately, there’s no way around the need for companies to ensure they have the highest quality data possible to get the best results from their AI and machine learning algorithms. While the process of identifying biases present in the data, as well as ensuring that it meets the necessary levels of comprehensiveness, completeness, and standardization can be complex, it’s an essential step towards debugging the data that underlies machine learning predictions. And most importantly, improving data quality not only boost important signals that can guide business decisions but establish the levels of trust and confidence in the complete process to act on those decisions.
Understanding which elements of data quality and data integrity matter most helps you get more out of your data. For more information on the state of data quality, take a look at our survey.