How to Clean Big Data: Data Quality Keeps Your Data Lakes from Turning into Data Swamps

How to Clean and Trust Your Big Data

You’ve probably witnessed it, and maybe are doing it. Many organizations are just dumping as much data into a data lake as they can, trying to get to every data source in an enterprise and putting all the data into the data lake. We see it here at Syncsort, with the vast amount of data from mainframe and other sources heading to the data lake for analytics and other use cases. But what if you can’t trust the data because it has errors in it, duplicated data (like customer records!), and generally just “dirty data.”

You need to clean it! Syncsort just announced Trillium Quality for Big Data to do just that. To get more insight into the challenges the new product helps tackle to clean Big Data, let’s use a real-world data quality example, creating a single customer view or any entity, like supplier, product, etc.

Download our eBook: The New Rules for Your Data Landscape

Parsing and Standardization are First Steps to Clean Big Data

There are a series of data quality steps to be taken to clean Big Data and to de-duplicate the data to get a single view. To create a single view of customer or product for instance, we need to have everything in a standard format to get the best match. Let’s talk through these steps, parsing and standardization.

Let me use a simple example of parsing and standardization.

100 St. Mary St.

As humans, we know that is an address and it is 100 Saint Mary Street, because we understand the position. Did you know that postal address formats can vary from country to country? For example, in COUNTRY X, the house number comes after the street name?

Now think about all the different formats for names, company names, addresses, product names, and other inventoried items such as books, toys, automobiles, computers, manufacturing parts, etc.

Now think about different languages.

Next Step to Draining the Swamp: Data Matching for the Best Single Records

Once we have all this data in a common, standard format we can then match. But this can even be complex. You can’t rely on customer IDs for instance to ensure de-duplicated data. Think about how many different ways customers are represented in each of your sources systems that have been polluting the data lake, matching is a hard problem.

Think about a name, Josh Rogers (I’ll pick on our CEO). The name could be in many different formats – or even misspelled – across your source systems and now in the data lake:

J. Rogers
Josh Rodgers
Joseph Togers

How to Clean Big Data: Data Quality Keeps Your Data Lakes from Turning into Data Swamps

If you use the right data quality tools to clean Big Data in your data lake, your data won’t reside in murky waters.

As a marketing analyst, I have a new product to promote, and I must make sure I’m targeting the right customer/prospect. If Josh lives in a small town in zip code 60451 (New Lenox, IL – my home town!), he’s probably the only one on that street.

But if his zip code is 10023 (upper west side of NYC), there might be more than one person with that name at that address (think about the name Bob Smith!). Matching is a complex problem, especially dealing with the data volumes in a data lake.

The last step is to commonize and survive the best fields to make up the best single record.

Now, Let’s Run This in Big Data

Creating the single best, enriched record is exactly what Trillium Quality for Big Data does. The product allows the user to create and test the steps above locally, then leverage Syncsort’s Intelligent Execution technology to execute them in Big Data frameworks such as Hadoop MapReduce or Spark. The user doesn’t need to know these frameworks, and it’s also future-proofed for new ones which we all know are coming.

So, what makes Trillium Quality for Big Data different.

  • The product has more matching capabilities than any other technology that ensure you get that single view
  • For those postal addresses, we have world-wide postal coverage for addresses and geocoding (latitude/longitude)
  • Performance and scalability using Intelligent Execution for execution in Big Data environment on a large and growing volume of data

Now it’s time to go clean the data swamp and make it a trusted, single view data lake!

Discover how today’s new data supply chain impacts how data is moved, manipulated, and cleansed – download our eBook The New Rules for Your Data Landscape today!

 

Take our 2017 Big Data Trends Survey! Spend 5 minutes to earn a $5 Starbucks gift card

 

Keith Kohl

Authored by Keith Kohl

Vice President, Product Management
0 comments

Leave a Comment

*