A Bird’s-Eye View of the Elephant: Hadoop Summit 2013
Hortonworks and Yahoo hosted yet another great Hadoop Summit last week!
The most interesting thing about this Summit was to see almost every single distribution vendor developing their own Big Data solution stack – a potential sign of fragmentation; and yet collaborating around YARN (Yet another Resource Negotiator) – which I found very exciting and promising. YARN is viewed as the next generation Hadoop, transforming Hadoop to be a true platform where multiple data flows, e.g. real-time processing and batch processing, can run. There is still a lot to do to make YARN enterprise ready, with main concerns around stability and security. The community acknowledges this by focusing on security and life cycle management with the projects Knox and Falcon respectively. Easy deployment, upgrades, and metrics are critical to the enterprise and we see Ambari addressing these requirements.
Storm-YARN was a big highlight of the Summit as it allows collocating real-time processing with batch processing while sharing data between Storm and MapReduce. Shaun Connolly of Hortonworks described this as applications running “IN” Hadoop versus “ON” Hadoop. Recently, Yahoo open sourced Storm-YARN under the Apache 2.0 license and YARN is in production at Yahoo with 30K nodes and 400K jobs per day! YES, it is real!
There was also tremendous interest in performance improvements: the session on Tez ─ a generic application framework for running complex data processing flows (directed-acyclic-graph) on top of YARN ─ was so full that I was told 30 people would have to leave the room before they could let me in! So, I decided to move on to “Putting the Sting in Hive,” a detailed session on Hive performance improvements.
Multiple sessions covered optimized columnar file formats, ORC File, Parquet, etc. again with focus on performance and compression benefits.
YARN is a huge step towards transforming Hadoop from being a solution to being a data processing platform. Hadoop has moved to the next stage with Apache Hadoop 2/YARN and the market is at a point crossing the chasm, as Rob Bearden of Hortonworks puts it. However, skill set and security remain the biggest gaps in enterprise adoption of Hadoop, as Gartner’s Merv Adrian highlighted during his keynote.
We, at Syncsort, are excited to have our high performance ETL product, DMX-h ETL, generally available for both Hadoop 1 and Hadoop 2. DMX-h ETL delivers on enterprise requirements such as security, performance and ease of use, with a user friendly GUI for developing MapReduce jobs that run “IN” Hadoop. The product is shipped with a set of use case accelerators for common ETL flows to help organizations get started and quickly become productive with Hadoop. Take DMX-h for a Test Drive, available with Cloudera and Hortonworks Hadoop distributions, and see what we mean. Don’t forget to share your experience with us!
Overall, it was great to see the community’s clear vision and commitment to make Hadoop 2 the enterprise data platform.
Hadoop Summit may be over, yet a lot is happening at Syncsort this week! DMX-h ETL, our Hadoop ETL product, is now GA! We welcomed our new CEO Lonne Jaffe! We are energized with all that’s happening and committed to strengthen Hadoop as an enterprise data platform!