The Data Transformation Process Explained in Four Steps
These days, understanding the steps involved in data transformation is important for lots of folks, even if data transformation is not a primary part of their job. Because we live in a world where data is collected, stored and analyzed in so many different formats, being able to perform the basic steps required to transform data from one form to another is a common requirement for many of us.
This article explains what those steps are by outlining a typical data transformation process. While the exact nature of data transformation will vary from situation to situation, the steps below are the most common parts of the data transformation process.
Step one: Data interpretation
The first step in data transformation is interpreting your data to determine which type of data you currently have, and what you need to transform it into.
Data interpretation can be harder than it looks. As a simple example, consider the fact that many operating systems and applications make assumptions about how data is formatted based on the extension that is appended to a file name. Thus, your computer is likely to assume that a file name video.avi is a video file, or that text.doc is a Microsoft Word file.
The problem with these labels is that the actual data inside a given file (or a directory or database) could be very different from what the file name suggests. Users can add whichever extensions they want to a file name; changing the extension doesn’t actually transform the data.
For this reason, interpreting data accurately requires tools that can peer deeper inside the structure of a file or database to see what is really inside, instead of what a file name or database table name suggests is inside. Tools like the Linux command-line utility file are useful for this purpose.
You also, of course, need to determine the target format—in other words, the format that your data should have after transformation is complete. If you do not already know that format, you’ll want to read the documentation for the tool or system that will receive your transformed data in order to determine which formats it supports or expects.
Step two: Pre-translation data quality check
Once you (or your data transformation tool) have figured out which kind of data formats you are working with and which forms you will transform data into, you should run a data quality check on the data. A data quality check allows you to identify problems, such as missing or corrupt values within a database, in the source data that could lead to problems during later steps of the data transformation process.
Step three: Data translation
After the data quality of your source data has been maximized, you can begin the process of actually translating data. Data translation means taking each part of your source data and replacing it with data that fits within the formatting requirements or your target data format.
For example, you may be transforming an old HTML file that was written using an outdated HTML standard into HTML5, the latest standard, and the one that most modern Web browsers expect. Part of the data translation process, in this case, would involve replacing deprecated HTML tags, such as <dir> (a tag that was used in old versions of HTML to help create lists), with <ul> (the list tag supported by modern HTML).
Data translation often entails not just replacing individual pieces of data with another piece, but also restructuring the overall file in a significant way. For example, a CSV file that is formatted as a series of comma-separated words would require considerable restructuring to convert into an XML file, which organizes information using cascading hierarchies of tags.
Step four: Post-translation data quality check
In order to ensure that your translated data will be maximally useful, you will also want to perform a data quality check. In this step of the process, you look for inconsistencies, missing information or other errors that may have been introduced during the data translation process. Even if your data was error-free before translation, there is a decent chance that problems will have been introduced during translation.
In most real-world scenarios, the data transformation steps described above would be performed automatically by software tools. If these steps sound like work that you are not prepared to perform, then, worry not.
Still, it’s valuable for human operators to understand what their data transformation tools are doing at each step of the data transformation process, and how each action adds up to make data transformation possible.
Make sure to download our eBook, “The New Rules for Your Data Landscape“, and take a look at the rules that are transforming the relationship between business and IT.