When we announced our DMExpress Hadoop offering, we shared a set of results from benchmark testing that had been completed. Testing has continued since, and I wanted to dedicate this post to sharing some of those results.
We did a series of tests that distill down to:
- TeraSort benchmark (if you’re not familiar with this benchmark, it is worthwhile to search it online)
- Aggregation based on TPC-H generated data (aggregated on order id for line item data)
We varied two things in the tests:
- Compression in the shuffle step: no compression, GZIP
- Data volume: We ramped up to 4TB on the TeraSort and 600GB on the Aggregation
The tests were done on a 10-node cluster running CDH3u2 (Apache 0.20.2).
The results were very interesting, but not surprising. For TeraSort:
- No compression:
- While DMExpress was faster for smaller data volumes (under 1TB), the elapsed times were still small – 15.12 minutes for native sort vs. 11.93 minutes with DMExpress for 500GB
- When you pump up the data volumes, DMExpress really outperformed the native sort – 240.48 minutes for native sort vs. 144.18 minutes with DMExpress for 4TB. That’s a 40% improvement and nearly 2x faster. That was consistent for 1TB and 2TB, as well
- GZIP compression, the results were consistently 2x or more faster:
- 20.82 minutes for native vs. 8.98 minutes with DMExpress for 500GB
- 223.82 minutes vs. 84.72 minutes with DMExpress for 4TB, more than 2x faster!
For the Aggregation, we wrote the same aggregation logic in Java, Pig and DMExpress (a key benefit with DMExpress is using a GUI rather than coding, but this post is focused on performance). The compression results were consistent across the board with the non-compression results, so I will just give you the results using GZIP:
- 150GB of data
- Java: 2.4 minutes
- Pig: 2.92 minutes
- DMExpress: 1.18 minutes
- Java: 7.89 minutes
- Pig: 11.15 minutes
- DMExpress: 4.07 minutes
DMExpress is nearly 2x faster vs. Java, and consistently more than 2x faster than Pig.
What’s that mean for you? It means that you can do more with less nodes, which has implications for the CapEx and OpEx associated with it. Simply stated, you can process more data with the cluster you already have available. If you happen to be running on a public cloud, faster processing times also mean less usage time.
If you have any questions or want to learn more, please feel free to leave a comment.