Understanding Data Quality: How Data Quality Errors Arise
Data quality is important to business. That you know. But do you understand what it takes to provide data quality In yesterday’s blog, we looked at one real-life example of a data quality failure. Today we’ll review how data quality problems can arise.
Briefly defined, data quality refers to the ability of a data set to serve whichever need a company hopes to use it for. That need could be sending marketing materials to customers. It could be studying the market to plan a new product feature. It could be maintaining a database of customer data for help with product support services, or any number of other goals.
No matter what the exact use case for your data, data quality is important because without it, the data can’t fulfill its intended purpose. Errors within a database of addresses would prevent you from using the data to reach customers effectively. A database of phone numbers that doesn’t always include area codes for each entry falls short of providing the information you need to put the data to use in many situations.
The Causes of Data Quality Problems
Now that we’ve outlined what data quality means and provided a few examples of what it looks like in the real world, let’s delve a bit deeper into the types of problems that lead to data quality shortcomings.
Here are six common ways in which data quality errors can creep into your organization’s data operations, even if you generally adhere to best practices when it comes to managing and analyzing your data:
1. Manual Data Entry Errors
Humans are prone to making errors, and even a small data set that includes data entered manually by humans is likely to contain mistakes. Typos, data entered in the wrong field, missed entries and so on are virtually inevitable.
2. OCR Errors
Machines can make mistakes when entering data, too. In cases where organizations must digitize large amounts of data quickly, they often rely on Optical Character Recognition, or OCR, technology to do so. OCR technology scans images and extracts text from them automatically. It can be very useful when, for example, you want to take thousands of addresses that are printed on paper and enter them into a digital database so you can analyze them using Hadoop. The problem with OCR is that it is almost always imperfect.
If you’re OCR’ing thousands of lines of text, you’re almost certainly going to have some characters or words that are misinterpreted – zeroes that are interpreted as eights, for example, or proper nouns that are read as common words because the OCR tool fails to distinguish properly between capital and lowercase letters. The same sorts of issues arise with other types of automated machine entry of data, such as text-to-speech
3. Lack of Complete Information
When compiling a data set, you frequently run into the problem of not having all information available for every entry. For example, a database of addresses may be missing the zip codes for some entries because the zip codes couldn’t be determined via the method that was used to compile the dataset.
4. Ambiguous Data
When building a database, you may find that some of your data is ambiguous, leading to uncertainty about whether, how and where to enter it.
For example, if you are creating a database of phone numbers, some of the numbers you seek to enter may be longer than the typical ten digits that you have in a United States phone number. Are those longer numbers simply typos, or are they international phone numbers that include more digits? In the latter case, does the number contain complete international dialing information?
These are the sorts of questions that are hard to answer quickly and systematically when you’re working with a large body of data.
5. Duplicate Data
You may find that two or more data entries are mostly or completely identical.
For example, maybe your database contains two entries for a John Smith living at 123 Main St. Based on this information, it’s difficult to know whether these entries are simply duplicates (maybe John Smith’s information was entered twice by mistake) or if there are two John Smiths (a father and son, perhaps) living at the same address. You need to sort out seemingly duplicate entries like this to make the best use of your data.
6. Data Transformation Errors
Converting data from one format to another can lead to mistakes.
As a simple example, you may have a spreadsheet that you convert to a comma-separated value, or CSV file. Because data fields inside CSV files are separated by commas, you may run into issues when performing this conversion in the event that some of the data entries in your spreadsheet contain commas inside them.
Unless your data conversion tools are sufficiently smart, they won’t know the difference between a comma that is supposed to separate two data fields and one that is an internal part of a data entry. This is a basic example; things get much more complicated when you must perform complex data conversions, such as taking a mainframe database that was designed decades ago and converting it to NoSQL, a category of database that has become popular in just the last few years.
Correcting Data Quality Errors
These are the types of data quality mistakes that are very difficult to avoid. In fact, the best way to think about data quality problems is to recognize them as inevitable.
It’s not because your data management process is flawed that you have data quality problems. It’s because the types of data issues described above are impossible for even the best run data operation to avoid.
Fortunately, there are solutions. Syncsort offers a range of data integration and data quality tools that can help you minimize the number of errors that are introduced during processes like data conversions, then find and automatically fix the data quality problems that do arise.
Check out our eBook: 4 ways to measure data quality.