How the Cloud Complicates Data Quality (and How You Can Fix It)
By now, you’ve heard all about the advantages of cloud computing. But the cloud also has its downsides. Among them are the special challenges for data quality that arise when data and data applications move to the cloud.
This does not mean that you shouldn’t use the cloud for storing and processing data. It does mean, however, that you need to take special care to manage data quality in cloud environments. Let’s explore how.
The Benefits of Cloud-Based Data Management
Lest this article appear to have an anti-cloud bent, let me make clear that the cloud is by no means a bad solution for data management.
When you move data and data analytics to the cloud, you get lots of benefits. The most obvious is scalability, or the ability to increase or decrease quickly the amount of infrastructure available for hosting and processing data.
Other benefits of cloud-based data management include easy-to-deploy data analytics tools, because you can take advantage of tools that your cloud provider offers as a service. Managing data in the cloud can also help you to avoid network bottlenecks. If your data originates in the cloud and you store and process it in the cloud, you don’t have to worry about delays while you wait to move data over the Internet to an on-premise environment.
Data Quality Drawbacks in the Cloud
On the other hand, when data management takes places in the cloud, data quality can suffer – if you don’t take steps to address it. That is true for a number of reasons:
- In the cloud, you often have little control over how the tools you rely on collect and process your data. You have to use whichever tools your cloud vendors provides you, and your ability to tweak the way those tools work is typically limited. If you run Hadoop on-premise, for example, you can configure it and modify it to your heart’s content. But in the cloud, you’re stuck with whichever Hadoop-as-a-Service solution your cloud vendor offers. The reason that this can create data quality challenges is that it limits your ability to transform and standardize data in ways that make data sets consistent and predictable.
- When you move data within the cloud, or between the cloud and on-premise infrastructure (if you choose to do that), you run the risk of formatting problems, data loss, inaccurate timestamps and other issues that undercut data quality. For example, if you move block data from a virtual server disk into a cloud-based file-storage service, formatting differences could cause data quality problems. Or data could be damaged while being transferred over the network.
- Cloud data can become very big, fast. The fact that the cloud is so scalable makes it easy to store huge volumes of information in the cloud. The more data you have, the harder it can be to maintain data quality.
- Cloud services are always changing and being updated – and unlike software that you set up and manage yourself on-premise, cloud-based tools may not always notify you when they are modified. Changes to your cloud-based tools can cause data quality issues if, for example, a tool modifies the way it structures data and your other tools are not configured to handle the new format.
Solutions: Maximizing Data Quality in the Cloud
So, what’s a forward-thinking data management team to do? Avoiding the cloud entirely is not the answer; that would put your organization at a disadvantage by denying it the benefits of the cloud.
Instead, you want to be sure that, when you take advantage of the cloud to assist in data management, you put data quality measures into place at the same time.
The most obvious and most fundamental way of doing this is to ensure that you run automated data quality checks on all of your data, whether it is based in the cloud or not. You should always have data quality checks in place.
At the same time, taking steps to minimize the number of data migrations between different services, or between the cloud and on-premise, can also improve data quality. So can a policy for archiving or deleting data from the cloud when you no longer need it, in order to avoid having your data sets grow too large and unwieldy.
Finally, remember that you don’t need to use all of your cloud vendor’s data management and analytics tools if you don’t want to. You can always take advantage of the cloud for data management in some ways, while still performing other tasks on-premise – or in your own custom cloud-based environment. You could, for example, set up your own Hadoop environment, using a distribution of your choice, in the cloud, rather than adopting the Hadoop-as-a-Service that the cloud vendor supplies.
The bottom line: It’s possible to enjoy the benefits of the cloud and ensure data quality at the same time. But it won’t happen without the right processes in place.
For more information on achieving high quality data in the cloud, read TDWI Checklist Report: Cloud Data-Quality Tool Considerations now!