Automated data entry such as OCR and text-to-speech methods of digitizing analog data save a lot of time – but they almost never deliver perfect results. In fact, they can be a data quality nightmare.
This is why data quality control is especially important if you rely on tools for automated data entry or conversion.
This article explains why automated data entry creates special data quality challenges and discusses strategies for addressing them.
Digital Data Sources and their Discontents
Some data is “born digital.” That means that it was first created in an electronic format.
For example, application logs and information that customers input into Web forms is born digital. These types of data are digitized and live on a computer from the start.
Yet most organizations still work with some data sources that are not born digital. For example, a company may require employees to submit paper receipts when requesting reimbursement for travel expenses. Or an organization may receive snail-mail letters that it archives.
Organizations also face the challenge of data that is born in one format but needs to be converted to another. For instance, your customer support team might record phone calls with customers. But even if the raw audio data that you collect in this way is stored in a digital format (like MP3 files), it can’t be analyzed using a text-based analytics platform like Hadoop. You need a way to convert audio files to text.
Automated Data Entry
Situations like the ones described above are why automated data entry tools are useful. Automated data entry tools take data in analog form and digitize it, or convert data from one form (like speech in an audio file) to another (like words in text form).
The most common form of automated data entry involves Optical Character Recognition, or OCR, tools. You can scan a paper document, then use an OCR tool to copy the text from the document into a digital text file.
Another common type of automated data entry involves taking recordings of human speech and converting the speech to text. This can be useful for transcribing phone calls or recordings of meetings.
There are even tools for converting smells to digital data – though, admittedly, your organization probably doesn’t have a reason to do that.
The Perils of Automated Data Entry
Automated data entry tools save loads of time. An OCR program can convert hundreds of thousands of words written on paper to digital text in minutes. A human being would require many tens of hours to input that data manually into a computer.
The downside of automated data entry, however, is that even the best tools make mistakes. Recognizing words within scanned images or audio files is just hard. Consider the following challenges:
- When converting text on paper to digital text, your OCR tools may get confused if a stain obscures part of the text. A human reading a piece of paper might be able to sort out the text even if it has coffee spilled on it, but an OCR tool might not because it is not designed to handle situations like that. As a result, some text is not properly digitized.
- In small print, characters like 0 and 8 can look similar. This confuses OCR tools.
- OCR and text-to-speech tools generally rely on dictionaries to help them determine what is a word. This works well when all the data they scan consists of common nouns in the language that the tools support. But when you have a proper name that is not in the dictionary file, text in a foreign language, a line of computer code or something else that is unexpected, the tools stand a much poorer chance of converting the information accurately to digital form.
- OCR works well with text written in plain fonts that tools were designed to support. Good luck, however, in scanning a document written with German Gothic characters. And as far as handwritten text goes, even the very best, most advanced OCR tools stand very little chance of recognizing that correctly.
- Speech-to-text tools tend to do a poor job of figuring out which words are being spoken by which people, especially in cases where people are talking over each other. As a result, speech-to-text conversions may produce a jumble of words inside a text file but give you little idea of who said what.
- With automated data entry, you lack the types of metadata that you typically get with born-digital sources. For example, when you are working with data from a computer log file, you can look at file system metadata to determine when the log file was created and when it was last modified. This information adds context that can be useful when performing analytics. In contrast, analog data sources don’t usually have metadata associated with them. You generally can’t tell from looking at a piece of paper whether the words written on it were recorded yesterday or a decade ago, unless there is a date on it (and even if there is a date, you have no way of knowing for certain that it is correct).
OCR errors like these are the reason why databases like Google Books, which relies heavily on OCR for making the words inside older books digitally searchable, contain so many misspelled words.
These types of errors lead to poor data quality when you are working with data sources that have been converted using automated data entry tools. Inaccuracies, missing data and other types of problems undercut the reliability of the data and cause analytics difficulties.
Improving Data Quality
How do you solve these data quality challenges?
One way, of course, is to have someone review all your automatically converted data by hand. But that takes almost as long as entering the data manually in the first place.
A variant of this approach is to rely on crowd sourcing data correction. Crowd sourcing means that you ask a large number of people – usually volunteers – each to review small pieces of your data to correct errors manually. This is what Google does through the ReCAPTCHA program, for example.
This crowd-sourcing strategy works if you’re an organization as large as Google, and if your data sources can be displayed to the public. Unfortunately, it’s less useful for everyone.
If you can’t crowd source your data quality improvement, you can always use a data quality tool to check and fix the work of your automated data entry tools. Data quality tools scan databases and look for misspellings, missing data, and inconsistencies, then automatically attempt to fix them.
They also cross-check databases against each other to help identify information that may be wrong. This is a good way of correcting, for instance, the risk of misspelling names within OCR’d data sources. If one database that consists of OCR’d address information contains an entry for a Jon Smith living at 123 Main Street, but ten other databases based on the same source say that there is a John Smith at that address, a data quality tool will recognize this inconsistency and surmise that an OCR error caused the name to be misspelled in the first database.
Syncsort’s suite of Big Data solutions now includes data quality tools as well as data analytics and integration solutions. To learn more about how to take advantage of resources like these to streamline and optimize your data operations, check out the TDWI Report: Building a Data Lake Checklist.