Stock Up, Then Light Up Your Dark Data
The data warehouse isn’t going the way of Windows XP (that unsupported operating system which is still powering most of the world’s ATMs). A data warehouse has value precisely because it is supported, structured, predictable and reliable – especially when it’s been designed with master data management, has smoothly operating ETL and a team of folks who know how to interpret its contents.
Interpreting the meaning of “Dark Data” takes a bit of untangling. Because of the widely read UBM computer security publication Dark Reading, “Dark Data” connotes hackers and cybersecurity. At the transparency-minded Sunlight Foundation, “Dark Data” refers to data that is maintained by governments or their contractors but kept far from the light of open data portals. In computer hardware design, “Dark Silicon” refers to transistor underutilization in microprocessor design, or to a UCSD academic center that studies it.
That’s what Dark Data isn’t.
What is the Dark Data of Big Data? Jeff Reser had cosmology in mind when he wrote that “Dark matter is 84% of the universe, and we can’t see it. Dark data in a sense is everything we do and we still can’t see it.” Intel, in its “Getting Started with Hadoop” document, sees the exploitation of Big Data’s Variety in part a challenge to address “the heterogeneity of big data – or ‘shadow’ or ‘dark data,’ such as access traces and web search histories.”
Gartner defined data as “dark” when organizations collect with the intention of using it, but are unable to do so. Some of this subterranean digital detritus isn’t detritus at all. And while Matt Aslett of 451 Research refers to Dark Data as “previously ignored because of technology limitations,” there are also organizational limitations. Projects fall out of favor, lose funding, or fall prey to the latest disruptor. Think of them as going dark.
Credit: Timo Elliott of SAP
As Syncsort’s Steve Totman analogizes: “It’s like buying a fridge full of food but not eating it. It’s expensive to buy, expensive to store and then you just throw it out.”
A study by Avepoint estimated that U.S. federal agencies, for example, already have 1.6 petabytes of Dark Data. Some of this federal data trove may be at least partly illuminated by a White House Executive Order “Making Open and Machine Readable the New Default for Government Information” (May 8, 2013). Avepoint cites one example of how this Dark Data can be used from the Veterans Health Administration; the VHA has been able to use some of its unstructured Dark Data to help identify high-risk patients.
Timo Elliott recommends a study of Passenger Dwell Time at the Copenhagen Airport, which was accomplished using logs from Wi-Fi routers. This Dark Data can be used for facilities planning, passenger flow analysis and queue design.
Here are a few steps organizations can take to crank up the lumens on their Dark Data:
- Poll domain experts within the enterprise to assess which Dark Data is worth staging into unstructured repositories
- Establish very small tiger teams (as small as one investigator) to take exploratory dives into Dark Data using contemporary analytics tools
- Identify which traditional structured data could be better informed when supplemented by Dark Data. (Low-hanging fruit can still be quite tasty
- Streamline structured data systems by stripping out unused elements that can be pushed to lower cost Dark Data repositories
- Brainstorm revenue-generating or cost-reducing ideas that could be fueled by Dark Data
- Find out what customers, suppliers and competitors are doing with their Dark Data
- Make friends with the Hadoop team
Iron Mountain Truck U-Turn Unlike the food in your fridge, Dark Data won’t spoil. It will wait — perhaps quietly, cost-effectively stored in Hadoop. That is, assuming it has been rescued from dusty shelves of mag tape and other relics of yesterday’s backups and neglected data.
Is Dark Data Lurking in Legacy Backups?
Learn how Syncsort and HP Vertica Analytics capabilities can be harnessed to unlock and monetize Dark Data — and extended to Hadoop when needed.