As we wrapped up the Strata Conference + Hadoop World 2013, I remembered the first Hadoop World, a one day conference with three session tracks and about 500 registered participants. Blog posts after the event claimed ‘Hadoop usage is exploding’.
At its fifth year, Hadoop World, now a joint event by Cloudera and O’Reilly, had a much larger audience with eight session tracks and over 3000 attendees. The content has evolved from being technology focused to being business and application focused, with topics ranging from facial recognition to healthcare, from agriculture to waste management, and from romance to defense. At Data Driven Business Day, after a series of sessions where there was no mention of terabytes or petabytes of data, Alistair Croll highlighted that the discussions were all about data driven decision making and business problems we are trying to solve. This is a strong indication that the technology is maturing and the focus is shifting to use cases.
Overall, it was great to see that the community is now delivering on the vision to make Hadoop 2 the enterprise data platform. The ecosystem is evolving with more and more vendors bringing domain expertise to the Hadoop platform, e.g. Appfluent visibility for Hadoop, Syncsort’s Mainframe to Hadoop offload solution, and with more use cases where the platform is used to operationalize and optimize the outcomes at scale.
With the general availability of Apache Hadoop 2 a couple of weeks before the conference, followed by Hortonworks HDP 2.0 announcement, Hadoop morphed into an operating system for big data. HDP 2.0 had strong ecosystem support with several certified partners, including Syncsort, contributing to the modern data architecture. “Apache Hadoop v2 is not just a major release number, but represents a generational shift in the architecture of Apache Hadoop,” explained Arun Murthy in his blog on the announcement. YARN architecture allows multiple data flows, e.g. real-time processing and batch processing, to run on the same cluster sharing data and metadata, providing a resource management layer abstracted from applications. The main impact is decoupling of Hadoop from MapReduce, MapReduce becomes just another application running in Hadoop. There were quite a lot of sessions related to applications moving into the Hadoop 2 stack and running on top of YARN. Storm-YARN was the highlight for Hadoop Summit 2013; Spark – an in-memory processing engine for machine learning applications and interactive data mining – became a highlight at Strata + Hadoop World through Cloudera’s announcement of direct support for Apache Spark with CDH. These are all critical steps towards running all types of workloads on Hadoop, it is inevitable, as Doug Cutting predicted.
Mike Olson announced the beta of Cloudera Enterprise 5 during his keynote, and introduced the Enterprise Data Hub. The Enterprise Data Hub has Hadoop, CDH 5, as its core, Cloudera Manager for enterprise class deployment, management & monitoring, and Cloudera Navigator for governance, data discovery and lineage. The Enterprise Data Hub is supported by a broad ecosystem of supporting tools and technologies, including Syncsort.
Another major theme of the conference was around Cloud deployments. Preceding the Strata + Hadoop World, Mirantis announced its version of OpenStack cloud distribution including Windows integration and Hadoop as a Service. During the conference, Rackspace announced Big Data Platform offerings, powered by HDP, on dedicated servers and external storage such as EMC Isilon, and on public and private clouds. Cloudera launched an expanded partner program, Cloudera Connect: Cloud, to support Hadoop deployments in public cloud environments. Cloud solution providers Savvis, IBM SoftLayer, T-Systems and Verizon Enterprise Solutions have already joined the program. Olson also shared plans to support private cloud deployments through OpenStack and VMware.
One of the most remarkable moments of the conference for me personally, was when Jim Kaskade of Infochimps questioned our purpose and confronted all of us in the room. He called out to the entire community to focus our strategy and execution on solving big problems that really matter, like cancer. Once we channel the efforts of the ecosystem towards solving the world’s problems, only then, I think, we can call Big Data “the best thing since sliced bread” – an analogy John Choi of IBM used.
Geoffrey Moore, at Hadoop Summit 2012, discussed being in early market, where the eco system pulled together the technical part of the ‘whole’ product and made sure things actually can work at scale. With Hadoop transforming into an enterprise data platform deployed on private and public clouds, and becoming a data hub that can run anywhere, we are very close to having a technical ‘whole’ product and crossing the chasm.