How Data Quality Impacts Machine Learning
GIGO – Garbage In, Garbage Out – is a saying that’s been around since the early days of computing. But in the age of artificial intelligence, machine learning, and data quality, that old adage is more relevant than ever.
In his article, “Data quality in the era of Artificial Intelligence,” George Krasadakis, Senior Program Manager at Microsoft, puts it this way:
“Data-intensive projects have a single point of failure: data quality.”
He goes on to say that because data quality is such a critical issue, his team at Microsoft starts every project with a data quality assessment.
Machine learning is nothing if not data-intensive. For that reason, the quality of the data used in any machine learning project will inevitably have a huge effect on its chances for success. Let’s take a closer look at just how data quality impacts machine learning.
Why Machine Learning Algorithms are Vulnerable to Poor Quality Data
Machine learning (ML) is a branch of artificial intelligence in which computers learn to discern and act on subtle patterns in data without being explicitly programmed to do so. The ML algorithm learns by using large amounts of training data to adjust its internal parameters until it can reliably discriminate similar patterns in data it has not seen before.
By its very nature, a machine learning model is acutely sensitive to the quality of the data with which it low-quality. Because of the huge volume of data required, even relatively small errors in the training data can lead to large scale errors in the system’s output. As a recent article in the International Journal on Advances in Software says, “High-quality datasets are essential for developing machine learning models.”
Achieving the Data Quality Required for Machine Learning
As Microsoft’s Krasadakis indicates, assessing and improving data quality should be the first step of any machine learning project. This includes checking for consistency, accuracy, compatibility, completeness, timeliness, and duplicate or corrupted records.
At the scale required for a typical ML project, adequately cleansing training or production data manually is a near impossibility. A single manual sweep through a large pool of data from disparate sources could take months. And, of course, that data isn’t static. It’s changing and increasing on a moment-by-moment basis.
That’s why taking advantage of an automated tool such as Syncsort’s Trillium Quality is crucial. Trillium Quality, which can be deployed for real-time or batch operation, on-premises or in the cloud, can efficiently cleanse data from a multitude of sources. Plus, Trillium allows you to design your data quality workflow visually with no coding required.
The Data Quality Challenge Will Continue to Grow
In his Harvard Business Review (HBR) article, “If Your Data Is Bad, Your Machine Learning Tools Are Useless,” Thomas C. Redman sums up our current data quality challenge this way:
“Increasingly-complex problems demand not just more data, but more diverse, comprehensive data. And with this comes more quality problems.”
In another HBR article, Redman notes that IBM has estimated that low-quality data costs businesses $3.1 trillion per year in the U.S. alone.
For any company that wants to participate in the machine learning revolution that’s already disrupting many parts of today’s business landscape, data quality is an issue that simply cannot be avoided.
Check out our eBook on 4 ways to measure data quality.