The Latest Buzz on Hadoop: Load your Hive even Faster
A couple of weeks ago, we rather silently released a new version of our DMX and DMX-h products, release 7.14. We are starting to get the word out about what is in that new release. One exciting new feature Nikhil Kumar recently blogged about is our Data Transformation Language (DTL). Another is our addition to read and write/load Hive tables. While there are tools already out there that can do read and write to Hive, I want to explain why ours is better…and by better I mean faster.
With our 7.14 release, DMX and DMX-h can now read and write/load Hive tables. Directly from our GUI, a user can select Hive as a source or target. The product also updates HCatalog when DMX loads the Hive table with associated metadata.
When a developer is building a job/program/script to load a Hive table, it’s unknown at execution time the size of the cluster, the network bandwidth, I/O speed, size of the data, and so on. So how can the loading of data be truly optimized without a lot of manual, deeply technical hand tuning…for each instance of a load (which could be hundreds of times a day)? I mentioned we loaded Hive faster…so here are the details. Our loading of Hive tables has been optimized such that we spawn multiple threads to load in parallel. This gives us higher throughput and improved elapsed speed over the Hive command line load for instance. As a further optimization, we dynamically determine the number of threads to spawn at execution time based on the size of the cluster, the network bandwidth, I/O speed, and the size of the data being loaded. And while DMX will optimize the number of spawned threads, the user can always control it through an execution-time environment variable.
Show me the numbers! We have done some initial tests vs. the Hive load command. We loaded 1TB, 2TB, and 4.5 TB’s into a relatively small Hadoop cluster (6 nodes). Below are the results.
In all of the cases we tested in our environment, DMX performed at least 2 times faster. In the smaller volumes, only 6 threads were spawned. For the larger data size, the product dynamically determined 8 threads would be optimal for this environment. We anticipate even better results for larger data volumes and larger clusters.
If you’re already a Syncsort DMX or DMX-h user, try it out today by upgrading to this latest release.
Stay tuned for more information about the 7.14 release.