Hadoop ETL – Offloading the Enterprise Data Warehouse
Hadoop is quickly becoming the operating system for Big Data, a platform that provides powerful services for developers and vendors to create Big Data applications. As this operating system becomes the new standard for Big Data, new and exciting use cases for Hadoop continue to emerge. Therefore, I’m happy to start our new series “Most Popular Use Cases for Hadoop” where we will look at some of these use cases, their motivations and potential benefits; and why not start with what we know best, Hadoop ETL.
Over ten years ago, ETL tools promised a simple approach to load data from multiple sources, transform it into valuable information and load it into a common repository – the enterprise data warehouse – where business users could leverage it for competitive insights. Over time, however ETL tools were overwhelmed by the accelerating volume, velocity and variety of data. As IT organizations realized their ETL tools couldn’t scale, they tried to keep up by adding more hardware, increasing database capacity and pushing the transformations – the “T” in ETL – into the data warehouse. Unfortunately, the data warehouse is not the best place to do this type of work (You can see why in one of the all time favorite blogs ETL vs. ELT.) Nevertheless, SQL became one of the most popular ETL tools. But that is soon to change… the truth is, after struggling for years to implement and scale conventional ETL tools, many organizations are now looking at Hadoop to collect, process and distribute more data than ever before at a disruptive cost.
So where to begin? Well, organizations can start by identifying some of the heaviest data transformations occurring in their data warehouse environments. Normally, 20% of the transformations can consume up to 80% of database capacity. Then, they can shift those transformations out from the data warehouse and into Hadoop. This approach will allow them to realize significant benefits very quickly, including: shortened ETL batch windows, faster database user queries, and significant operational savings in the form of spare database capacity.
However, when you do that, keep in mind Hadoop is not a complete ETL solution. Failing to recognize the gaps between the operating level services Hadoop provides and the functionality that users expect when deploying Enterprise ETL can create frustration and hamper the benefits of Hadoop… and that’s exactly where DMX-h ETL Edition comes into play, making sure you have everything you need to unleash the power of Hadoop and deploy a smarter approach to ETL. After all, we’ve been helping organizations offload the “T” from the data warehouse for years!
So, What challenges are you facing when deploying ETL in Hadoop? What is prompting you to offload the “T” from your data warehouse?