Hadoop MapReduce: to Sort or Not to Sort
Tuesday Jan 22nd was a critical milestone for us at Syncsort as our main contribution to the Apache Hadoop project was committed. This contribution, patch MAPREDUCE-2454, introduced a new feature to the Hadoop MapReduce framework to allow alternative implementations of the Sort phase. This work started more than a year ago and Syncsort’s Technology Architect Asokan worked closely with the Apache open source community on design iterations, code reviews and commits. We sincerely thank Apache Hadoop community and MapReduce project committers for their collaboration and support throughout this work and congratulate them on the release of Hadoop-2.0.3-alpha.
What is the big deal about Sort? Sort is fundamental to the MapReduce framework, the data is sorted between the Map and Reduce phases (see below). Syncsort’s contribution allows native Hadoop sort to be replaced by an alternative sort implementation, for both Map and Reduce sides, i.e. it makes Sort phase pluggable.
Opening up the Sort phase to alternative implementations will facilitate new use cases and data flows in the MapReduce framework. Let’s look at some of these use cases:
Optimized sort implementations. Performance of sort-intensive data flows and computation of aggregate functions requiring sort, like MEDIAN, will improve significantly when an optimized sort implementation is used. Such implementations can take advantage of hardware architectures, operating system and data characteristics. Improving the performance of sort within the MapReduce framework is already listed as one of the Hadoop Research projects, see http://wiki.apache.org/hadoop/HadoopResearchProjects under ‘Map reduce performance enhancements’, and sort benchmarks are often used for evaluating Hadoop.
Hash-based aggregations. Many aggregate functions where the output of the aggregation is small enough to fit in memory, e.g. COUNT, AVERAGE, MIN/MAX, can be implemented as hash-based aggregation that does not require sort (see MAPREDUCE-3247). A special sort implementation can support this by eliminating the sort altogether. Hash-based aggregations will provide significant performance benefit for applications such as log analysis and queries on large data volumes.
Ability to run a job with a subset of data. Many applications such as data sampling require processing a subset of the data, e.g. first N matches/limit N queries (see MAPREDUCE-1928). In Hadoop MapReduce, all Mappers need to finish before a Reducer can output any data. A special sort implementation using the patch can avoid the sort altogether so that the data can come to a single Reducer as soon as a few Mappers complete. The Reducer will stop after N records are processed. This will prevent launching a large number of Mappers and will drastically reduce the amount of wasted work, benefiting applications like Hive.
Optimized full joins. Critical data warehouse processes such as change data capture require a full join. Basic Hadoop MapReduce framework supports full joins in the Reducer. In certain cases where both sides of the join are very large data sets, Java implementation of a full join may easily turn into a memory hog. The patch will allow resource efficient implementations for handling large joins with performance benefits.
As my colleague Jorge Lopez’ blog post highlights, Big Data skills gap is a key challenge, technical skills around Hadoop, MapReduce and Big Data solutions are scarce and expensive. Involvement from development communities and software vendors will be critical for increased adoption of Hadoop as a data management platform. We, at Syncsort, are excited to be part of the community broadening the Hadoop platform, and increasing business value and ROI for enterprise Big Data initiatives.
For more information, see Syncsort’s Hadoop ETL solutions