Data is like the force: It has a light side and a dark side, and it holds the universe together. But dark data won’t do you much good. To keep your universe running smoothly, you need light data – which means data that is immediately accessible and actionable.
— syncsort (@Syncsort) March 29, 2017
Confused yet? If so, keep reading for an explanation of the difference between light and dark data, and how to turn the latter into the former.
Dark data is data that you can’t do much with because it’s difficult to analyze or integrate with the rest of your infrastructure.
The classic dark data example is mainframe data that is stored on tape. Tape storage is attractive because it’s inexpensive; indeed, in many cases, tape is the only way that massive amounts of mainframe data can be stored in a cost-efficient way.
But tape storage has major downsides. You can’t run analytics on data that is on tape. Moving data from tape to other media takes a long time and is complicated by compatibility issues. And tape deteriorates over time, so eventually, your dark tape data disappears into the ether entirely.
More generally, any type of mainframe data – even if it is not stored on tape – could be considered dark because it is hard to apply analytics tools to mainframe environments without performing a lot of complex and time-consuming conversions and migrations.
Inconsistent or low-quality data is also dark data. Quality problems prevent you from making use of the data.
Then there’s light data, which can be easily and quickly analyzed. It’s data that is immediately accessible to tools in Hadoop or Spark and can be used to deliver real-time analytics results. It’s stored on fast, reliable media, and it exists natively in formats that can be ingested by modern analytics tools. It’s consistent and of high quality.
Making Dark Data Light
You might think that once data becomes dark you’ll never be able to make it light again. After all, it’s usually much easier to turn something that is light dark than vice-versa.
Fortunately, however, data is not like laundry. There is no reason why you can’t take dark data and make it light again – or, even better, implement a data storage strategy that avoids dark data entirely.
How do you lighten your dark data? The answer is simple: You move it from clunky, outdated storage locations like tape to Hadoop, where it is immediately actionable. Or you set up a mainframe storage workflow that allows you to move mainframe data to Hadoop as soon as the data is created, thereby bypassing tape archives altogether.
Or, if your data is of low quality, lightening it involves fixing inconsistencies, missing information, formatting errors and other problems that lower its quality.
Data Lightening with DMX-h and Trillium
If you’ve ever tried to move a tape archive to a modern analytics environment like Hadoop by hand, you know it’s neither a fun nor an easy task. It takes a long time because tape I/O is slow – you’re lucky if you can get above 120MB/s with most tape hardware, which is slower than commodity magnetic hard disks can support today – and there are huge amounts of data to handle. Even more challenging, it requires specialized expertise because mainframe data and native Hadoop data are very different beasts.
That’s where DMX-h, Syncsort’s data integration solution, comes in. DMX-h automatically ingests mainframe data into Hadoop, empowering you to make the tape-to-Hadoop migration by flipping a single switch. You don’t have to be a data conversion expert.
And if you want to skip the tape altogether, you can take advantage of DMX DataFunnel, which automatically transforms DB2 data from mainframe sources into data that can be ingested by Hadoop.
Last but not least, if poor data quality is what’s making your data dark, that’s a challenge Syncsort can now help solve, too. Thanks to the recent Trillium acquisition, Syncsort now offers data quality solutions in addition to data integration tools.