Data infrastructure optimization software
Data integration and quality software
Data availability and security software
Cloud solutions

Hadoop MapReduce: to Sort or Not to Sort

Tuesday Jan 22nd was a critical milestone for us at Syncsort as our main contribution to the Apache Hadoop project was committed. This contribution, patch MAPREDUCE-2454, introduced a new feature to the Hadoop MapReduce framework to allow alternative implementations of the Sort phase. This work started more than a year ago and Syncsort’s Technology Architect Asokan worked closely with the Apache open source community on design iterations, code reviews and commits. We sincerely thank Apache Hadoop community and MapReduce project committers for their collaboration and support throughout this work and congratulate them on the release of Hadoop-2.0.3-alpha.

What is the big deal about Sort? Sort is fundamental to the MapReduce framework, the data is sorted between the Map and Reduce phases (see below). Syncsort’s contribution allows native Hadoop sort to be replaced by an alternative sort implementation, for both Map and Reduce sides, i.e. it makes Sort phase pluggable.

Sort in MapReduce

Opening up the Sort phase to alternative implementations will facilitate new use cases and data flows in the MapReduce framework. Let’s look at some of these use cases:

Optimized sort implementations. Performance of sort-intensive data flows and computation of aggregate functions requiring sort, like MEDIAN, will improve significantly when an optimized sort implementation is used. Such implementations can take advantage of hardware architectures, operating system and data characteristics. Improving the performance of sort within the MapReduce framework is already listed as one of the Hadoop Research projects, see under ‘Map reduce performance enhancements’, and sort benchmarks are often used for evaluating Hadoop.

Hash-based aggregations. Many aggregate functions where the output of the aggregation is small enough to fit in memory, e.g. COUNT, AVERAGE, MIN/MAX, can be implemented as hash-based aggregation that does not require sort (see MAPREDUCE-3247). A special sort implementation can support this by eliminating the sort altogether. Hash-based aggregations will provide significant performance benefit for applications such as log analysis and queries on large data volumes.

Ultimate Checklist for Hadoop ETL / Hadoop MapReduce

Ability to run a job with a subset of data. Many applications such as data sampling require processing a subset of the data, e.g. first N matches/limit N queries (see MAPREDUCE-1928). In Hadoop MapReduce, all Mappers need to finish before a Reducer can output any data. A special sort implementation using the patch can avoid the sort altogether so that the data can come to a single Reducer as soon as a few Mappers complete. The Reducer will stop after N records are processed. This will prevent launching a large number of Mappers and will drastically reduce the amount of wasted work, benefiting applications like Hive.

Optimized full joins. Critical data warehouse processes such as change data capture require a full join. Basic Hadoop MapReduce framework supports full joins in the Reducer. In certain cases where both sides of the join are very large data sets, Java implementation of a full join may easily turn into a memory hog. The patch will allow resource efficient implementations for handling large joins with performance benefits.

As my colleague Jorge Lopez’ blog post highlights, Big Data skills gap is a key challenge, technical skills around Hadoop, MapReduce and Big Data solutions are scarce and expensive. Involvement from development communities and software vendors will be critical for increased adoption of Hadoop as a data management platform. We, at Syncsort, are excited to be part of the community broadening the Hadoop platform, and increasing business value and ROI for enterprise Big Data initiatives.

For more information, see Syncsort’s Hadoop ETL solutions

  • Razeel — March 5, 2013 at 2:01 am

    Thank You.. 🙂

  • […] Hadoop MapReduce: to Sort or Not to Sort by Tendu Yogurtcu. […]

  • Sean — March 21, 2013 at 2:49 pm


  • Rajesh Nakkana — July 4, 2013 at 11:01 am

    If I am transforming data. For example 1000 input records get transformed into 1000 output records and stored in a hbase. I don’t need any sorting for this case. Can I just skip the reducer and store the output of the mapper into hbase.

    • Tendu Yogurtcu
      Tendu Yogurtcu — July 5, 2013 at 3:08 am
      In reply to: Rajesh Nakkana

      Hi Rajesh,
      Assuming that you are just loading data to HBase with some simple transformation (filter and/or reformat), you don’t need any Reducer. In this (Map only) case, the Hadoop MapReduce framework does not do any sorting.

  • Caetano Sauer — November 25, 2013 at 8:33 am

    Thank you for implementing this and patching it into the main Hadoop distribution. I was planning to do this for a project and you guys just saved us like 50% overhead! Sorting and shuffling in Hadoop can indeed be optimized in many ways.

  • bodla dharani kumar — February 14, 2014 at 2:41 am

    hi to all,
    Good Morning,
    I had a set of 22documents in the form of text and loaded in hdfs,when running a map/reduce funtion from command line of hdfs ,it took 4mins 31 secs for streaming the 22 text files.How do increase the map/reduce process as fast as possible so that these text files should complete the process by 5-10 seconds.
    What changes I need to do on ambari hadoop.
    Allocated 2GB of data for Yarn,and 400GB for HDFS
    default virtual memory for a job map-task = 341 MB
    default virtual memory for a job reduce-task = 683 MB
    MAP side sort buffer memory = 136 MB
    And when running a job ,Hbase error with Region server goes down,Hive metastore status service check timed out.

    Thanks & regards,

  • Lynell — April 20, 2016 at 8:52 am

    This info is the cat’s paajsam!

Leave a Comment

Related Posts