If you work with data, you’ve probably heard the term data quality more than a few times. But what is data quality? Do you know what data quality actually means, and what data quality analysts do? If not, this article’s for you.
Data quality may not be quite as popular a buzzword as Big Data, but it’s an oft-used term in the data world. Data analysts like to remind everyone that having data quality is essential to derive value from data.
If you’re wondering what is data quality? You’ve come to the right place!
But they don’t always take the time to define data quality or provide real-world examples of the types of problems that data quality tools correct. So, let’s take a look.
What is Data Quality? A Data Quality Definition
A basic data quality definition is this: Data quality is the ability of a given data set to serve an intended purpose.
To put it another way, if you have data quality, your data is capable of delivering the insight you hope to get out of it. Conversely, if you don’t have data quality, there is a problem in your data that will prevent you from using the data to do what you hope to achieve with it.
Data Quality Examples
To illustrate the definition further, let’s examine a few examples of real-world data quality challenges.
Imagine that we have a data set that consists of names and addresses. Data like this is likely to contain some errors for various reasons – both simple and complicated ones.
Simple causes of data errors are names and addresses that were entered incorrectly, or address information that has changed since it was collected.
There are other, more complicated problems that may exist in the data set. One is entries that are ambiguous because of incomplete information. For example, one entry might be an address for a Mr. Smith who lives in the city “London,” with no country specified. This is a problem because we don’t know whether the London in which Mr. Smith resides is London, England, London, Ontario or one of the other dozen-or-so cities around the world named London. Unless you use a data quality tool to correct this ambiguity, you’ll face difficulty using your data set to reach Mr. Smith.
As another example of a complex data quality problem, consider the issue of seemingly redundant addresses within the data set. Let’s say we have multiple entries in our database for people named Mr. Smith who reside at 123 Main Street. This could be the result of a simple double-entry: Perhaps the data for Mr. Smith was entered more than once by mistake. Another possibility is that there are multiple Misters Smith – a father and son, perhaps – residing at the same address. Or maybe we are dealing with entries for totally unrelated men who both happen to have the same last name and reside at 123 Main Street, but in different towns. Without data quality correction, there’s too much ambiguity in a data set like this to be able to rely on the data for marketing or customer-relations purposes.
Fixing Data Quality Problems
One way to correct data quality issues like these is to research each inconsistency or ambiguity and fix it manually. That would take a huge amount of time, however. It’s not practical on a large scale.
A much more time- and cost-efficient approach is to use automated data quality tools that can identify, interpret and correct data problems without human guidance. In the case of a data set composed of names and addresses, they might do this by correlating the data with other data sets to catch errors, or using predictive analytics to fill in the blanks.
The Never-Ending Data Quality Battle
Because data quality is defined in terms of a data set’s ability to serve a given task, the precise nature and characteristics of data quality will vary from case to case. What one organization perceives as high-quality data could be rubbish in the eyes of another organization.
Understanding how data quality changes based on context is important because it means that data quality is not something you can simply obtain and keep. You may have data quality today but lose it tomorrow if your goals change and your data in its current state can no longer meet them.
So, think of data quality as a never-ending battle. It’s something you need to be constantly working on and improving to ensure that your data is ready to meet whichever tasks you throw at it.
Using Syncsort to Achieve Data Quality in the Data Lake
As organizations liberate data from traditional silos across the enterprise and centralize it in data lakes for high-powered analytics, data governance is becoming a top priority, especially in highly regulated industries, such as banking, insurance, financial services and healthcare. Syncsort has combined the power of Syncsort DMX-h high-performance Big data integration software to quickly and efficiently access data from any source and load it into the data lake, while using Trillium Discovery to profile that data.
Customers can now understand the quality of the data in Hadoop or Spark, so they can be confident the data meets their governance criteria, or identify where they need to take actions to clean up the data where necessary. Read the report on Building a Data Lake to learn more.