ETL vs. ELT: A Tale of Staging and Set Based Processing
Last week I was reading a discussion thread that somehow resurrected an old debate about ETL vs. ELT. In other words, Should the “T” live in the database or on the “ETL” tool? Several arguments were made in favor of pushing the “T” down to the database. Later that week, I got a similar question from the audience during a TDWI Webinar with Claudia Imhoff:
Databases are better at set based operations than record-at-a-time ETL tools. Therefore, Why not use the database for ETL?
To better explain this, I called my colleagues Dave Nahmias and Mike Wilkes to the rescue, so I would like to share some of their thoughts in this post.
Yes, SQL is based on set theory. Its original goal was to limit the result set, allowing users to retrieve specific information from the database. “Prior to relational databases, users were forced to read through volumes of green bar reports to get to what interested them. SQL allowed users to select only the information they wanted to see,” says Dave.
Traditional RDBMs implemented indexes as means to accelerate this process by quickly locating and combining large lists of records. Once the database arrives to the desired answer, it retrieves the data and presents it to the user. According to Dave, it is exactly in this last part of the process where the problem resides. “Since the goal of a query was usually to narrow the results, the amount of data returned was usually small, so the inefficiencies of data retrieval were hidden. For example, I just ran a test than joined a 24 million row table with a 6 million row table in Oracle. Producing a count of records from the join took about 30 seconds, since this primarily involved processing the indexes. When I added columns to the select and piped the output to NULL (to avoid the cost of writing to disk), the process took over 5 minutes.” Dave explains that unlike user queries, a typical ETL process usually results in millions of records delivered to another table and very often most of the columns are part of the output.
I guess by now you’re starting to get the picture… Who said staging data was best practices? Today it seems more like a really expensive workaround.
This is when Dave brings up the good news. “With DMExpress, there is no need for a staging area, since it can join sources at near – and sometimes better than – indexed speeds. This eliminates the need for terabytes of storage, and removes nightly loading, truncating and indexing from the process.” Now, that’s clever! Eliminate staging to reduce otherwise increasing database and maintenance costs.
So maybe it’s time to address the second part of the question, the part that talks about record-at-a-time ETL tools. Yes, most ETL tools process one record-at-a-time. “Luckily we are not them” states Mike with his usual confidence. “We use direct I/O and read data in large, sequential reads. Then we pull it into memory buffers. We thread to keep those buffers flushed. We are operating on the data in memory assets.” Now, that is fast!