The Importance of Being Ordered
Automobiles and ETL – what could they possibly have in common? What if I told you it was fuel injection. Remember the muscle cars of yesteryear? They got their power through bigger and bigger engines. Back then, no one would have believed that a four cylinder car could be faster and more efficient. That was until fuel injection. Fuel injection focused on the most inefficient part of the process – gas consumption. By improving this one aspect, engines could be smaller, faster and cheaper. And once an on-board computer could automatically tune the process, fuel injection became the norm. At the time it required a radical re-orientation, but today, nearly every vehicle uses fuel injection.
I believe we are at a similar inflection point in ETL. What is the most inefficient part of the ETL process? Ordering records! Joins, aggregations, rankings, database loads, etc. all depend on ordering to complete their processing. You may be surprised to learn that upwards of 75 percent of all ETL processing is spent ordering. If you need to join two heterogeneous sources (for example a file and a table), the records have to be lined up. If you need to group records in order to create subtotals, the records have to be ordered. If you need to create rankings, you guessed it; the records have to be ordered.
Current thinking is to attack the need to order data with the equivalent of a muscle car. More hardware, more parallelism and more cost. Most ETL tools don’t innovate in this area and simply use parallelism to address the problem. This looks good on a slide with perfect streams of data being operated on concurrently. Unfortunately, reality is not so perfect.
First, these streams are only perfect if the source is partitioned exactly like the target (which is almost never the case). To line up the data, either a sort has to be performed on each node or each node has to exchange records with every other node in order to get them lined up properly for the target. Pick your poison, either a lot of simple sort tasks feeding a simple merge or lots of network traffic and resources to pass records around. And what about all those additional barely-used CPUs that are required to address what is essentially an I/O problem?
A second approach is to load the data into a database. Unfortunately, a database is not very good at ordering records that aren’t indexed. So each time a task is pushed down to the database, a temporary table has to be created to hold the data, the data has to be loaded into that temp table, and then an index has to be created (by the way, creating the index requires ordering also).
This approach is able to keep up with demand as long as there is excess capacity in the database and the ETL process is nightly. However, as volume, currency and usage grows, ETL starts colliding with the business user. Add to that the cost of constantly expanding database hardware and software licenses to try to keep up with the Big Data explosion and many organizations are beginning to consider a more “radical” approach.
This approach, like fuel injection, is surprisingly simple. Rather than continue to build bigger and bigger engines – how about just improve the record ordering process. By injecting intelligence into the most inefficient part of the process, ETL can be performed faster with less hardware. Suddenly, complex joins can be done in seconds without the need for staging tables and indexes. ETL can once again be performed on commodity hardware with minimal impact on the environment.
Intelligent ordering is also finding rapid acceptance with the emergence of Hadoop. The Map-Reduce paradigm requires ordering in a number of its steps and the distributed, file-base architecture requires a small footprint. Because of this, Hadoop implementations are finding efficient ordering to be even more important.
So the next time you consider an ETL tool, make sure it is designed on a second-generation engine that understands the importance of being ordered.