Yet Another Hadoop Benchmark – Part I
Is this blog about yet another TeraSort benchmark?
Benchmarking is a very integral part of our life at Syncsort since the products we develop are highly focused on performance and scalability. Having said that, I often find published benchmark results well crafted. This blog is not about a single benchmark data point over thousands of nodes. It is about a set of Hadoop benchmarks focused on TeraSort and the most common Extract-Transform-Load (ETL) use cases. While it’s great to show results on thousands of nodes, I believe it is also important to show benchmark results in a more common real-world configuration. Based on what we have seen with our customers and on the data published in Apache Hadoop Wiki “Powered By Hadoop” page, we used a 10 node cluster for this particular test of our new products.
Now, let’s look at DMX-h Sort Edition first: DMX-h Sort delivers an alternative sort implementation for MapReduce Sort Phase. During the MapReduce data flow, the data is sorted using Syncsort’s sort algorithms instead of the native sort. Integration of DMX-h Sort is seamless; you can configure either a particular job or all jobs running on the cluster to use DMX-h Sort when possible. We ran the TeraSort benchmark with DMX-h Sort.
In today’s blog, I will focus on the TeraSort results. The tests are run with CDH 4.2, simply because this distribution is the first to include Syncsort’s contribution to the Apache MapReduce project, mapreduce-2454, in a generally available release.
– (10 + 1 + 1) nodes with 12 cores – Intel Xeon X5670, 2.93Ghz
– Memory: 96GB per node
– Disk drives: 12 x 3 TB 7200 RPM; I/O speed: 110 MB/Sec write and 140 MB/Sec read
– HDFS block size = 256 MB
– MapReduce version 1
The chart below displays the TeraSort benchmark results, elapsed time (clock time) to run TeraSort with native sort versus with DMX-h Sort. Map output compression is not enabled for this set of tests.
As you can see, the percentage gain in elapsed time increases as the input data size grows. With DMX-h Sort, the performance gain against the native sort scales out, from 35% for 0.5TB to over 55% for about 2 TB of data, i.e. more than 2x faster.
Let’s look at the amount of data that is being processed per unit of time per node, i.e. Megabytes of data processed per second per node (MB/sec/node). As the data size changes, the amount of data that is processed using DMX-h Sort remains constant, whereas the amount of data processed by native sort per second per node drops.
What does this mean? Basically, you can process more data per node by using DMX-h Sort, and this is without adding more nodes to the cluster as DMX-h helps to scale within each node. Processing more data per second per node implies cost savings both for CAPEX and OPEX; you don’t have to increase your cluster size every time there is a significant jump in the data growth. If your cluster is deployed on the cloud, this implies less usage and more cost savings.
In this particular test, we are setting a baseline with the standard TeraSort benchmark to demonstrate the benefits of using DMX-h Sort versus native sort. In Part II of this blog, we will focus on typical use cases for ETL, web log aggregation and Change Data Capture (CDC). Stay tuned…