I heard a collective gasp from the audience and I breathed a small sigh of relief when I reviewed and ran live two graphical map reduce jobs that replaced 988 lines of SQL. The run time on the Hadoop cluster I was connected to was a mere 12 minutes instead of the 2 hours it took in production in a very large Oracle instance. I was definitely shocked at how many people came up to me after the meetup and commented on how nice it was to see something running live. I guess they see a lot of PowerPoints at meetups, but I think it’s important to show the real deal.
The use case was from a recent POC I had completed with Syncsort’s DMX-h product processing 800 million rows of pharmaceutical/retail data through very CPU intensive ETL steps – 2 Joins, 1 group by or aggregation – yet another perfect candidate for processing using DMX-h and Hadoop. The current production job runs in Oracle and is mission critical to the business to promote improved and more targeted customer marketing campaigns. What makes it an even more amazing fit for the combination is the growth of data month over month and the sheer number of similar jobs that bog down the Enterprise Data Warehouse (EDW).
In effect, Hadoop and DMX-h was used to demonstrate how easy it is to offload such jobs to MapReduce. Ok, I know you’re probably waiting for the gory details of run times before versus after it was offloaded to Hadoop. Oracle run time – 2 hours and 40 minutes after many months of tuning the code and the underlying infrastructure compared to 12 min in DMX-h map reduce with no tuning – completely out of the box! I was thrilled and moreover proud of what I had just proved. And the ease with which I was able to develop the simple point and click map reduce job in DMX-h in a day and run it in 12 min!
I did NOT want to stop there – I wanted to keep going, so I decided to run the same job with 2X and 5X the amount of data to prove the much talked about scalability of Hadoop and DMX-h. The results were spectacular ─ 21 min and 51 min respectively (as you will see if you click the link to the YouTube video at the end of this blog). Now, did we test 2X the volume in Oracle?! Yes, indeed and it ran for 13 hours after tuning – further proving that offloading resource intensive processes from your EDW (in this case Oracle, but it could be any RDBMS) can pay out huge benefits. Think about all the free EDW resources you will have as a result! And all the data you can now process with no effort!
A huge portion of this I left out – the load to HDFS! This is huge because it took only 4 min to load 1X production data volume to HDFS and only 7 min for 2X. Guess how long it took to load the 3 Oracle source tables with 2X the volume… 20 hours due to the highly indexed table structures and the extremely busy data warehouse.
So now you know exactly why there was a collective gasp from the audience.
In conclusion, I would suggest a few things:
-DO NOT be conservative when planning data warehouse offload to Hadoop – DMX-h will make it seamless
-DO NOT worry about the ease of conversion from data warehouse SQL to DMX-h
-Do not worry about Java expertise or code generation – My ETL product background is all I needed!
To view my presentation on YouTube, click here.