Data infrastructure optimization software
Data integration and quality software
Data availability and security software
Cloud solutions

Moving Data into Hadoop Faster

Last month, I wrote a post about Hadoop performance benchmark results that we completed on TeraSort and Aggregation.  In advance of Hadoop Summit 2012, we completed new performance tests on loading the Hadoop Distributed File System (HDFS).

We ran the tests on the same 12-node (10 name nodes) cluster. However, this time we switched the cluster over to the Hortonworks Data Platform 1.0.7 (Apache 1.0.2).

To this point, we have only completed tests on “smaller” data sets – 10GB, 100GB, 200GB and 500GB – all with uncompressed data.

Because the Hadoop HDFS put command is single threaded, we wanted to find a way to parallelize the load.  Using DMExpress, we partitioned the data prior to invoking the put command.  We found through various tests that for our ETL server and Hadoop cluster, 20 partitions worked best. We used simple round robin partitioning of the data.

The results were very interesting and encouraging.

  • 10GB: DMExpress loaded in 71.4 seconds vs. 157.1 seconds for Hadoop put
  • 100GB: DMExpress 720.3 seconds vs. Hadoop put 1,571.6 seconds
  • 200GB: DMExpress 1,553.4 seconds vs. Hadoop put 3,173.3 seconds
  • 500GB: DMExpress 4,809.9 seconds vs. Hadoop put 8,526.6 seconds


 

 

 

 

As you can see, DMExpress is loading HDFS about 2x faster than out-of-the-box Hadoop put command for all the data sizes we’ve completed testing on to date.

You might be thinking, how do I use the partitioned data now that it’s “chunked” on HDFS? It’s actually quite simple! Hadoop (and DMExpress) can process wildcard input files from HDFS.  For example, if the input file to load is named ‘customer_file’, the resulting partitioned files on HDFS will be named ‘customer_file_1’, ‘customer_file_2’, etc.  Use customer_file* for your MapReduce job – whether you write your MR tasks in Java, Pig, or DMExpress.

We are going to “pump up the volume” on the data sizes, and we also plan to look at varying the number of partitions, block sizes, etc. in order to provide best practices for our customers.  Stay tuned for more results.

If you have any comments or want to learn more, leave a comment here.

Related Posts