Open Data is Great – But Only If You Ensure Data Quality
Open data is all around us these days, which is a great thing. To leverage open data effectively, however, you need to be prepared to address the data quality risks. Here’s why and how.
Defining Open Data
The open data movement takes its cues from the open source software movement.
Open source software refers to programs whose source code is available for the public to download, inspect, modify and, if desired, expand.
In a similar fashion, open data means sets of databases that anyone can access and use as they wish.
Open data is usually free of cost, although that is not the defining characteristic. Openness – that is, the quality of being openly accessible to anyone – is what makes open data what it is.
Scientific research projects also sometimes provide open data sets. The Human Genome Project makes a range of important data sets freely available, for example.
Why Use Open Data
Simply, it is a great resource. Companies can and should take advantage of open databases when the data fit their needs. In many cases, doing so is a fast and cost-effective way to gain access to data that can drive analytics engines and deliver important insights.
For example, imagine that your company wants to know what kind of public Wi-Fi infrastructure is available to customers to help predict how much bandwidth the company can expect an app to support for those customers. If the customers happen to be living in New York City, the company can grab open data related to Wi-Fi availability for residents. That’s a lot faster and easier than compiling all that data from scratch.
Open Data and Data Quality
As great as open data is, it comes with a caveat. In some cases, it may not provide the data quality required to make the data actionable.
This isn’t because most open data sets are inherently low in quality. The fact that they are (usually) free does not mean you can’t trust the data inside them. This may be the case with some open databases, but most open projects provide data that is as reliable as any you collect yourself. (Indeed, you can make the argument that because open data sets are available for anyone to inspect, they are likely to have fewer errors, because there are more people to notice that something is wrong.)
Still, no data set is perfect, and open databases are no exception. Take the open database related to Wi-Fi in New York City. The database includes the street address for each Wi-Fi access point, along with latitude and longitude coordinates. If it is important for you to know for certain exactly where each Wi-Fi access point is located, you’d want to cross-check this information to make sure the street addresses align with the map coordinates.
You’d also probably want to make sure that all the street addresses actually exist. Data entry errors, address changes or other problems could easily introduce flaws into this part of the database.
Data quality tools – including Trillium data quality solutions which are now part of Syncsort’s suite of Big Data solutions – can help you perform the checks you need to identify, and fix potential data quality errors like these.