Parallel ETL Tools Are Dead
They just don’t know it yet.
The critical flaw in parallel ETL tools is the fact that the data is almost never local to the processing nodes. This means that every time a large job is run, the data has to first be read from the source, split N ways and then delivered to the individual nodes. Worse, if the partition key of the source doesn’t match the partition key of the target, data has to be constantly exchanged among the nodes. In essence, parallel ETL treats the network as if it were a physical I/O subsystem. The network, which is always the slowest part of the process, becomes the weakest link in the performance chain.
The result is that the CPUs and memory on the local nodes are rarely fully utilized. Basically you have a system that under-utilizes the local hardware and over-utilizes the network. It is not surprising, then, that an efficient SMP ETL tool often outperforms bigger, more expensive parallel ETL tools. But, given the investment companies had made in these tools, it‘s difficult to justify a rip-and-replace strategy.
With the arrival of Hadoop, all of this has changed. Hadoop provides low cost storage as well as the potential for scalable ETL via the Map/Reduce paradigm. For the first time in a parallel environment, Hadoop guarantees that the data will be local to the nodes, a huge performance advantage. Now, ETL designers can take advantage of the scale of Hadoop without having to pay the penalty for excess network traffic. Because of this, as Hadoop matures, it could become the ETL platform of choice for large organizations.
Remember those ETL tools that were designed to be efficient in an SMP environment? They’re back! Now that the data is local, the ability of the tool to fully utilize hardware resources becomes even more important. A tight, efficient engine provides Hadoop with the ability to scale both horizontally and vertically – more work with less!
These simpler tools will also help with the adoption of Hadoop as an ETL platform. Currently, there is a huge disconnect between the ETL designer and the Java programmer. Most ETL designers don’t know Java and most Java developers don’t know data structures, so even if the processing is efficient, the coding isn’t. Organizations will need twice the people to solve half the problem. However, as these SMP ETL tools fully integrate with Hadoop, the visual design paradigm will be inherited by Hadoop making development much simpler and more data driven. This combination means that Hadoop will be providing the ideal combination of performance, scalability, and ease of use. At that point, why would customers pay for a heavyweight, complex, parallel ETL tool. I’m betting they won’t.