Data infrastructure optimization, availability & security software
Data integration & quality software
The Next Wave of technology & innovation

Moving Data into Hadoop Faster

Last month, I wrote a post about Hadoop performance benchmark results that we completed on TeraSort and Aggregation.  In advance of Hadoop Summit 2012, we completed new performance tests on loading the Hadoop Distributed File System (HDFS).

We ran the tests on the same 12-node (10 name nodes) cluster. However, this time we switched the cluster over to the Hortonworks Data Platform 1.0.7 (Apache 1.0.2).

To this point, we have only completed tests on “smaller” data sets – 10GB, 100GB, 200GB and 500GB – all with uncompressed data.

Because the Hadoop HDFS put command is single threaded, we wanted to find a way to parallelize the load.  Using DMExpress, we partitioned the data prior to invoking the put command.  We found through various tests that for our ETL server and Hadoop cluster, 20 partitions worked best. We used simple round robin partitioning of the data.

The results were very interesting and encouraging.

  • 10GB: DMExpress loaded in 71.4 seconds vs. 157.1 seconds for Hadoop put
  • 100GB: DMExpress 720.3 seconds vs. Hadoop put 1,571.6 seconds
  • 200GB: DMExpress 1,553.4 seconds vs. Hadoop put 3,173.3 seconds
  • 500GB: DMExpress 4,809.9 seconds vs. Hadoop put 8,526.6 seconds


 

 

 

 

As you can see, DMExpress is loading HDFS about 2x faster than out-of-the-box Hadoop put command for all the data sizes we’ve completed testing on to date.

You might be thinking, how do I use the partitioned data now that it’s “chunked” on HDFS? It’s actually quite simple! Hadoop (and DMExpress) can process wildcard input files from HDFS.  For example, if the input file to load is named ‘customer_file’, the resulting partitioned files on HDFS will be named ‘customer_file_1’, ‘customer_file_2’, etc.  Use customer_file* for your MapReduce job – whether you write your MR tasks in Java, Pig, or DMExpress.

We are going to “pump up the volume” on the data sizes, and we also plan to look at varying the number of partitions, block sizes, etc. in order to provide best practices for our customers.  Stay tuned for more results.

If you have any comments or want to learn more, leave a comment here.

5 comments
  • Sean — June 29, 2012 at 9:26 am

    10 name nodes? do you mean 10 data nodes?

  • Keith Kohl — July 1, 2012 at 10:48 am

    Hi Sean. Thanks for reading the BLOG. Yes I meant data nodes, not the name node.

  • Raghavendra — February 6, 2013 at 3:49 am

    Hi,

    Can you please send me some basic commands to use in hadoop for sending and receiving files to and from hdfs and to partition the input file in hadoop, at rmdraghu6@gmail.com

    Thanks,
    Raghavendra

  • Keith Kohl
    Keith Kohl — February 6, 2013 at 11:31 am

    Hi Raghavendra-

    Take a look at these shell commands for Hadoop. http://hadoop.apache.org/docs/r0.17.2/hdfs_shell.html

    But better yet, our product DMEpxress does this using a graphical interface and can pre-sort, partition and compress the data before loading into HDFS.

    –Keith

  • Skip — April 20, 2016 at 2:08 pm

    I read your post and wished I’d wreittn it

Leave a Comment

Related Posts