Data Quality 101: What is Data Quality?

What is Data Quality? Explaining What Data Quality Actually Means

If you work with data, you’ve probably heard the term data quality more than a few times. But what is data quality? Do you know what data quality actually means, and what data quality analysts do? If not, this article’s for you.

Data quality may not be quite as popular a buzzword as Big Data, but it’s an oft-used term in the data world. Data analysts like to remind everyone that having data quality is essential to derive value from data.

What is Data Quality? A Data Quality Definition

If you’re wondering what is data quality? You’ve come to the right place!

But they don’t always take the time to define data quality or provide real-world examples of the types of problems that data quality tools correct. So, let’s take a look.

What is Data Quality? A Data Quality Definition

A basic data quality definition is this: Data quality is the ability of a given data set to serve an intended purpose.

To put it another way, if you have data quality, your data is capable of delivering the insight you hope to get out of it. Conversely, if you don’t have data quality, there is a problem in your data that will prevent you from using the data to do what you hope to achieve with it.

Data Quality Examples

To illustrate the definition further, let’s examine a few examples of real-world data quality challenges.

Imagine that we have a data set that consists of names and addresses. Data like this is likely to contain some errors for various reasons – both simple and complicated ones.

Simple causes of data errors are names and addresses that were entered incorrectly, or address information that has changed since it was collected.

Download the Report: Gartner Magic Quadrant for Data Quality Tools

There are other, more complicated problems that may exist in the data set. One is entries that are ambiguous because of incomplete information. For example, one entry might be an address for a Mr. Smith who lives in the city “London,” with no country specified. This is a problem because we don’t know whether the London in which Mr. Smith resides is London, England, London, Ontario or one of the other dozen-or-so cities around the world named London. Unless you use a data quality tool to correct this ambiguity, you’ll face difficulty using your data set to reach Mr. Smith.

As another example of a complex data quality problem, consider the issue of seemingly redundant addresses within the data set. Let’s say we have multiple entries in our database for people named Mr. Smith who reside at 123 Main Street. This could be the result of a simple double-entry: Perhaps the data for Mr. Smith was entered more than once by mistake.

Another possibility is that there are multiple Misters Smith – a father and son, perhaps – residing at the same address. Or maybe we are dealing with entries for totally unrelated men who both happen to have the same last name and reside at 123 Main Street, but in different towns. Without data quality correction, there’s too much ambiguity in a data set like this to be able to rely on the data for marketing or customer-relations purposes.

Related: Data Quality – A Review of Use Cases & Trends

Fixing Data Quality Problems

One way to correct data quality issues like these is to research each inconsistency or ambiguity and fix it manually. That would take a huge amount of time, however. It’s not practical on a large scale.

A much more time- and cost-efficient approach is to use automated data quality tools that can identify, interpret and correct data problems without human guidance. In the case of a data set composed of names and addresses, they might do this by correlating the data with other data sets to catch errors, or using predictive analytics to fill in the blanks.

What is Data Quality? Fixing Data Quality Problems

The Never-Ending Data Quality Battle

Because data quality is defined in terms of a data set’s ability to serve a given task, the precise nature and characteristics of data quality will vary from case to case. What one organization perceives as high-quality data could be rubbish in the eyes of another organization.

Understanding how data quality changes based on context is important because it means that data quality is not something you can simply obtain and keep. You may have data quality today but lose it tomorrow if your goals change and your data in its current state can no longer meet them.

So, think of data quality as a never-ending battle. It’s something you need to be constantly working on and improving to ensure that your data is ready to meet whichever tasks you throw at it.

Using Syncsort to Achieve Data Quality in the Data Lake

As organizations liberate data from traditional silos across the enterprise and centralize it in data lakes for high-powered analytics, data governance is becoming a top priority, especially in highly regulated industries, such as banking, insurance, financial services and healthcare. Syncsort has combined the power of Syncsort DMX-h high-performance Big data integration software to quickly and efficiently access data from any source and load it into the data lake, while using data quality tools to profile that data.

Download the Gartner Magic Quadrant Report to learn how leading solutions can help you achieve your long-term data quality objectives.

Christopher Tozzi

Authored by Christopher Tozzi

Christopher Tozzi has written about emerging technologies for a decade. His latest book, For Fun and Profit: A History of the Free and Open Source Software Revolution, is forthcoming with MIT Press in July 2017.

0 comments

Leave a Comment

*