Dave Nahmias

Parallel ETL Tools Are Dead

August 29, 2012

They just don’t know it yet.

The critical flaw in parallel ETL tools is the fact that the data is almost never local to the processing nodes.  This means that every time a large job is run, the data has to first be read from the source, split N ways and then delivered to the individual nodes.  Worse, if the partition key of the source doesn’t match the partition key of the target, data has to be constantly exchanged among the nodes.  In essence, parallel ETL treats the network as if it were a physical I/O subsystem.  The network, which is always the slowest part of the process, becomes the weakest link in the performance chain. 

The result is that the CPUs and memory on the local nodes are rarely fully utilized.  Basically you have a system that under-utilizes the local hardware and over-utilizes the network.  It is not surprising, then, that an efficient SMP ETL tool often outperforms bigger, more expensive parallel ETL tools.  But, given the investment companies had made in these tools, it‘s difficult to justify a rip-and-replace strategy.

With the arrival of Hadoop, all of this has changed.  Hadoop provides low cost storage as well as the potential for scalable ETL via the Map/Reduce paradigm.  For the first time in a parallel environment, Hadoop guarantees that the data will be local to the nodes, a huge performance advantage.  Now, ETL designers can take advantage of the scale of Hadoop without having to pay the penalty for excess network traffic.  Because of this, as Hadoop matures, it could become the ETL platform of choice for large organizations.

Remember those ETL tools that were designed to be efficient in an SMP environment? They’re back! Now that the data is local, the ability of the tool to fully utilize hardware resources becomes even more important.  A tight, efficient engine provides Hadoop with the ability to scale both horizontally and vertically – more work with less!

These simpler tools will also help with the adoption of Hadoop as an ETL platform. Currently, there is a huge disconnect between the ETL designer and the Java programmer.  Most ETL designers don’t know Java and most Java developers don’t know data structures, so even if the processing is efficient, the coding isn’t.  Organizations will need twice the people to solve half the problem.  However, as these SMP ETL tools fully integrate with Hadoop, the visual design paradigm will be inherited by Hadoop making development much simpler and more data driven.  This combination means that Hadoop will be providing the ideal combination of performance, scalability, and ease of use.  At that point, why would customers pay for a heavyweight, complex, parallel ETL tool.  I’m betting they won’t.

{ 0 comments }

Automobiles and ETL – what could they possibly have in common? What if I told you it was fuel injection. Remember the muscle cars of yesteryear? They got their power through bigger and bigger engines.  Back then, no one would have believed that a four cylinder car could be faster and more efficient. That was until fuel injection.  Fuel injection focused on the most inefficient part of the process – gas consumption.  By improving this one aspect, engines could be smaller, faster and cheaper. And once an on-board computer could automatically tune the process, fuel injection became the norm.  At the time it required a radical re-orientation, but today, nearly every vehicle uses fuel injection.

I believe we are at a similar inflection point in ETL.  What is the most inefficient part of the ETL process? Ordering records!  Joins, aggregations, rankings, database loads, etc. all depend on ordering to complete their processing.  You may be surprised to learn that upwards of 75 percent of all ETL processing is spent ordering.  If you need to join two heterogeneous sources (for example a file and a table), the records have to be lined up.  If you need to group records in order to create subtotals, the records have to be ordered.  If you need to create rankings, you guessed it; the records have to be ordered.

Current thinking is to attack the need to order data with the equivalent of a muscle car.  More hardware, more parallelism and more cost.  Most ETL tools don’t innovate in this area and simply use parallelism to address the problem.  This looks good on a slide with perfect streams of data being operated on concurrently.  Unfortunately, reality is not so perfect.

First, these streams are only perfect if the source is partitioned exactly like the target (which is almost never the case).  To line up the data, either a sort has to be performed on each node or each node has to exchange records with every other node in order to get them lined up properly for the target.  Pick your poison, either a lot of simple sort tasks feeding a simple merge or lots of network traffic and resources to pass records around.  And what about all those additional barely-used CPUs that are required to address what is essentially an I/O problem?

A second approach is to load the data into a database.  Unfortunately, a database is not very good at ordering records that aren’t indexed.  So each time a task  is pushed down to the database, a temporary table has to be created to hold the data, the data has to be loaded into that temp table, and then an index has to be created (by the way, creating the index requires ordering also).

This approach is able to keep up with demand as long as there is excess capacity in the database and the ETL process is nightly.  However, as volume, currency and usage grows, ETL starts colliding with the business user. Add to that the cost of constantly expanding database hardware and software licenses to try to keep up with the Big Data explosion and many organizations are beginning to consider a more “radical” approach.

This approach, like fuel injection, is surprisingly simple. Rather than continue to build bigger and bigger engines – how about just improve the record ordering process.  By injecting intelligence into the most inefficient part of the process, ETL can be performed faster with less hardware.  Suddenly, complex joins can be done in seconds without the need for staging tables and indexes.  ETL can once again be performed on commodity hardware with minimal impact on the environment.

Intelligent ordering is also finding rapid acceptance with the emergence of Hadoop.  The Map-Reduce paradigm requires ordering in a number of its steps and the distributed, file-base architecture requires a small footprint. Because of this, Hadoop implementations are finding efficient ordering to be even more important.

So the next time you consider an ETL tool, make sure it is designed on a second-generation engine that understands the importance of being ordered.

{ 1 comment }