Offload the Data Warehouse in the Age of Hadoop
|Steve Totman of Syncsort|
A fully loaded oil tanker can take more than a mile to turn, and as long as 15 minutes to stop. In a recent webcast, Steve Totman, Director of Strategy at Syncsort likened the challenges of a traditional data warehouse to turning around an oil tanker. Typical warehouse problems, he said, include longer-duration load jobs, ever-expanding staging areas, capacity demands and what he termed the “SQL Tuning Death Spiral.” Routine events as minor as adding a new column have a ripple effect: the data model would have to be updated by an enterprise architect, then the ETL scripts updated and finally QA regression testing. If the new column resulted in a performance hit, a SQL expert would also be called upon.
All this hampers agility, ties up valuable talent with maintenance activities, and adds costs.
In the 90s, the data warehouse was envisioned as an information nirvana; a “single source of truth.” By using a data warehouse, it was hoped, analysts could produce reports, build dashboards and conduct research independently without impairing production systems. Do these newly surfaced problems mean that the very idea of a data warehouse is dead in the water?
Two experts believe all is not lost for the battle-weary data warehouse. Despite the many challenges, Totman and Santosh Chitakki, VP of Products at Appfluent, seek to throw the traditional data warehouse a lifeline. Rethink the enterprise data hub in light of Big Data tools, they suggest. If, as Teradata reports, 20-40% of machine resources are consumed with data warehouse ETL operations, there are also clear opportunities for optimization and pruning.
|Santosh Chitakki of Appfluent|
Citing several examples from large customers, Totman and Chitakki believe that cost savings of as much as $15 million have been achieved by moving selected portions of data warehouse content and ETL overhead to Hadoop. Performance savings can be even more dramatic; ETL for a sample loan application that typically ran for 6 hours was reduced to 15 minutes. Even more compellingly, Syncsort customer comScore loads 1.6 trillion events into a Greenplum-based warehouse. This is warehousing on a scale not envisioned by most warehouse architects.
A Modest Proposal
“Know your data” is an old bromide in IT shops, but Big Data has given the expression new urgency. Taking the Syncsort and Appfluent offerings together as the two recommend, and stripping the concept to its barest minimum, the idea is to identify unused and rarely used data and move it to a Hadoop cluster.
To avoid throwing out the baby with the bathwater, Appfluent’s tool Visibility can be used to gain insight into how data is used, abused or unused. The latter Chitakki terms “dormant.” One of the more valuable services performed by Visibility is to uncover de facto data retention practices, which typically have a huge impact upon Big Data volume.
Forecast: Still More Offloading
Totman and Chitakki are not alone in their predictions. MapR CEO John Schroeder echoed their assessments in an interview :
“The year 2014 will see a majority of companies deploying Hadoop to offload ETL processing and data from enterprise data warehouses to Hadoop.”
Much of this offloading will be motivated by cost reductions, or at the least, cost containment. Totman, Chitakki and Schroeder independently offer a similar savings estimate: that enterprise hubs based on Hadoop are an order of magnitude cheaper than alternative approaches.
Perhaps more significantly for many organizations, tools like DMX-h and Visibility offer a clear migration path for IT staff. By leveraging these GUI-enabled tools, usable by non-developers when necessary, warehouse staff can kickstart new projects. Importantly, the tools make it possible to follow the Pareto principle (i.e., the 80/20 rule) by identifying the 20% of data manipulation that causes most of the growing pain.
Open Offload Options
While offloading to Hadoop is the focus of this conversation, it was clear from audience questions that a go-slow approach will be taken by some regulated enterprises. To such practitioners, a destabilized data warehouse would be a big concern; mandated external reporting as well as internal operations dashboards could be disrupted. For these risk-averse scenarios, Totman explains that Syncsort DMX-h offers alternate off-loading pathways as well as the native Hadoop connector that is available with the Cloudera, MapR, Hortonworks and pure Apache Hadoop distros.
In the meantime, Hadoop clusters seem poised to gain even more traction in what has been the mainstay of structured data efforts: the data warehouse. As Totman observes, “the enterprise data warehouse isn’t going away.”
Mark Underwood writes about knowledge engineering, Big Data security and privacy.