Our “making-a-better-ETL-for-Hadoop” DMX-h Spring ’13 announcement today….why it’s cool!
Today, we here at Syncsort announced two new Hadoop products – DMX-h Sort and DMX-h ETL. I don’t want to repeat the announcement here, I’d rather talk about how I think this is unique. When we started planning these new releases, we set our goal to: “make the hard, easy; and the impossible, possible.”
We consistently hear from our customers and partners, that ETL (some call it data refinement, data collection & preparation, etc.) is the #1 use case for Hadoop. But if you think about typical ETL, there are some problems that are very hard with Hadoop and the distributed nature of the data on Hadoop. Join for instance, particularly when both sides of the Join are large, is one of them.
This was a challenge that we embraced. We not only wanted to help our customers and partners solve Hadoop issues, we wanted to make it easy to do and significantly increase performance at the same time. Part of our announcement today is “smarter productivity” with use case accelerators for common use cases, such as Join. These are pre-built job templates complete with documentation. You simply fill in the metadata, and then you can execute. Or you can try it out with the samples we include.
Another common hard (impossible?) problem is Change Data Capture – the process of taking two data sets (the current data set and the previous data set), and identify the changed data. The changed records need to be flagged as New, Updated, or Deleted.
This can be very difficult in MapReduce when both data sets are large. You could literally write hundreds of lines of Java code, or use DMX-h ETL. The approach is very straight forward using our GUI.
So, why is this hard in MapReduce? Well it’s not a difficult concept, but there are difficulties in implementation. Think about what needs to be done on the Map side before you can identify the changed records on the Reduce side. The Map side needs to bring the data together in each Mapper, while keeping track of whether the data is from the current or the previous data set. It then needs to be sorted before sending to the Reducers. And don’t forget, you’re going to 1) need to have all of the same records (based on a key like customer ID) to the same reducer, and 2) do not hard code the number of reducers because you’re going to want this to be dynamic based on a number of factors, particularly data volume at execution time.
On the Reduce side, now that you have all of the like records based on some key (customer ID) in the same Reducer, you need to re-split the data from the current and previous data sets and then perform a full outer join. Records that appear in the current version but not the previous version are inserts, and those that appear in the previous version but not the current one, are deletes. For records that appear in both, the non-primary-key fields are compared and any cases where they are not identical are updates. Each changed record needs to be flagged, with an I, D, or U.
We actually have about a dozen use case accelerators right now and we will continue to add more as we work from our users.
Information about how we natively integrate into Hadoop MapReduce is available in the announcement along with some initial performance & scalability results. We will write some blog posts over the coming weeks about the performance benefits using DMX-h.
These are not only hard problems to solve in Hadoop, but we’ve been told, impossible to solve, with other approaches (i.e., other tools) on Hadoop. Making the hard, easy; and the impossible, possible…now that’s cool!