Tendu Yogurtcu

Tuesday Jan 22nd was a critical milestone for us at Syncsort as our main contribution to the Apache Hadoop project was committed. This contribution, patch MAPREDUCE-2454, introduced a new feature to the Hadoop MapReduce framework to allow alternative implementations of the Sort phase. This work started more than a year ago and Syncsort’s Technology Architect Asokan worked closely with the Apache open source community on design iterations, code reviews and commits. We sincerely thank Apache Hadoop community and MapReduce project committers for their collaboration and support throughout this work and congratulate them on the release of Hadoop-2.0.3-alpha.

What is the big deal about Sort? Sort is fundamental to the MapReduce framework, the data is sorted between the Map and Reduce phases (see below). Syncsort’s contribution allows native Hadoop sort to be replaced by an alternative sort implementation, for both Map and Reduce sides, i.e. it makes Sort phase pluggable.

MapReduce

Opening up the Sort phase to alternative implementations will facilitate new use cases and data flows in the MapReduce framework. Let’s look at some of these use cases:

Optimized sort implementations. Performance of sort-intensive data flows and computation of aggregate functions requiring sort, like MEDIAN, will improve significantly when an optimized sort implementation is used. Such implementations can take advantage of hardware architectures, operating system and data characteristics. Improving the performance of sort within the MapReduce framework is already listed as one of the Hadoop Research projects, see http://wiki.apache.org/hadoop/HadoopResearchProjects under ‘Map reduce performance enhancements’, and sort benchmarks are often used for evaluating Hadoop.

Hash-based aggregations. Many aggregate functions where the output of the aggregation is small enough to fit in memory, e.g. COUNT, AVERAGE, MIN/MAX, can be implemented as hash-based aggregation that does not require sort (see MAPREDUCE-3247). A special sort implementation can support this by eliminating the sort altogether. Hash-based aggregations will provide significant performance benefit for applications such as log analysis and queries on large data volumes.

Ability to run a job with a subset of data. Many applications such as data sampling require processing a subset of the data, e.g. first N matches/limit N queries (see MAPREDUCE-1928). In Hadoop MapReduce, all Mappers need to finish before a Reducer can output any data. A special sort implementation using the patch can avoid the sort altogether so that the data can come to a single Reducer as soon as a few Mappers complete. The Reducer will stop after N records are processed. This will prevent launching a large number of Mappers and will drastically reduce the amount of wasted work, benefiting applications like Hive.

Optimized full joins. Critical data warehouse processes such as change data capture require a full join. Basic Hadoop MapReduce framework supports full joins in the Reducer. In certain cases where both sides of the join are very large data sets, Java implementation of a full join may easily turn into a memory hog. The patch will allow resource efficient implementations for handling large joins with performance benefits.

As my colleague Jorge Lopez’ blog post highlights, Big Data skills gap is a key challenge, technical skills around Hadoop, MapReduce and Big Data solutions are scarce and expensive. Involvement from development communities and software vendors will be critical for increased adoption of Hadoop as a data management platform. We, at Syncsort, are excited to be part of the community broadening the Hadoop platform, and increasing business value and ROI for enterprise Big Data initiatives.

Stay tuned for our next blog, we will talk about how Syncsort’s per node scalability complements Hadoop’s horizontal scalability for Big Data integration… In the meantime, we would like to hear from you about your data integration experience on Hadoop!

{ 3 comments }

Last week’s Hadoop Summit brought together thought leaders from the Big Data ecosystem around a comprehensive agenda. Hosted by Hortonworks and Yahoo, the event was extremely well organized and it was powerful to see more than 2,100 attendees and 50 sponsors come together, a strong validation of the growing interest in collaborating to help better define the next-generation data platform.

The overarching message for Hadoop Summit 2012 was about bringing communities together to establish a robust Big Data ecosystem for making Hadoop “enterprise ready.” The definition of community is no longer limited to open source; it is inclusive of vendors and end users. In fact, there were several sessions that specifically highlighted the need for robust data platform services and open APIs to enable vendors to integrate with open source software.

In his keynote, Hortonworks VP of Strategy Shaun Connelly mentioned that the Big Data market is estimated at $100 billion in a recent Bank of America Merrill Lynch report, with about $14 billion of that sized for Hadoop. Furthermore, more than half of the world’s data is predicted to be touched or processed by Apache Hadoop by the end of 2015.

Not surprisingly, a major inhibitor for Hadoop adoption continues to be the inability to use existing IT skills. Enterprise adoption will require companies to leverage the investments that they have already made, and IT organizations to be comfortable and confident with the solutions. In this context, Syncsort has a lot to offer with a proven track record across thousands of data integration customers for its high performance, ease of use and low TCO. DMExpress is a great complement to the Big Data solutions stack.

Geoffrey Moore, best-selling author of Crossing the Chasm, was the keynote for the second day of the Hadoop Summit. He highlighted that consumer IT redefines the user experience, creating a disruption with enterprise IT on hold and consumer IT on fire. He described the Big Data business opportunity for companies that can “respect the demands of the enterprise and are fully committed to the user experience.” Companies with a vision to optimize business solutions instead of optimizing technologies will be poised to seize a big share of the Big Data market.

As my colleague Keith Kohl pointed out in his blog post on Hadoop Summit, it was a great week for Syncsort. We are very excited to be part of the Big Data technology adoption curve and what promises to be a game-changing decade!

{ 0 comments }

Last week,  I attended two days of sessions at GigaOM’s Structure:Data Conference  in New York City where over 700 attendees came together to discuss the business and industry-transformative nature of Big Data, and the latest technologies and approaches to best manage it all.

What struck me this year is that the conversation has evolved from Big Data being an infrastructure-only issue to now the realization that the Big Data stack requires contribution from everyone from the bottom layer of the infrastructure up through the top application layer.

The following key themes emerged from the onsite discussions and will be the focus as the community continues to develop the Big Data stack:

1) It’s all about high performance computing and speeding up analytics as data volumes grow exponentially. The pain points for unstructured versus structured data are different. While unstructured data requires better visualization of the data, structured data requires more cleansing making filtering and grouping much more critical. One of the speakers referenced a quote from Clay Shirky that, “Information overload is not the problem. It’s filter failure.”

2) The line between personal and business behavior is blurring as analytics moves out of the IT realm and into the hands of business users, and as a result there is an expectation that delivery of data can be more easily consumed, such as through visualization capabilities and collaboration.

3) Real-time decision making through predictive analytics and machine learning is becoming essential with sensor data, digital exhaust and need to get ‘insight’ to consumer behavior.

As such, there’s a realization that the Big Data market is fragmented, and there is plenty of opportunity to contribute to building the Big Data stack. Software packages and tools need to be built on top of Hadoop for example to increase enterprise adoption. Currently most of the available enterprise software is proprietary. Offering applications layered on top of Hadoop will spur the Big Data market leading to more open source contributions and additional opportunities for startups.

Syncsort has a lot to offer in the areas of performance, data integration, and processing – all critical components to the Big Data stack. We can deliver and run ETL over Hadoop without requiring a brand new development team and skill set. One of the speakers suggested that businesses should consider adopting Hadoop only if they are willing to dedicate a separate team. Syncsort’s offering eliminates this requirement for the enterprise. We can also efficiently move the data in and out of Hadoop which as John Webster points out in his CNET post continues to be an issue.

To reach the holy grail of Big Data management – the focus needs to be on building a top to bottom Big Data stack which will require different segments of the market to come together.

{ 0 comments }