Take-aways from Hadoop Summit 2016 in San Jose: The Evolution of the Big Data Conversation

Take-aways from Hadoop Summit 2016 in San Jose: The Evolution of the Big Data Conversation

The Hadoop Summit was a great way to celebrate 10 years of Hadoop. There were over four thousand attendees, over 170 sessions and lots of new sponsors. The event generated a lot of buzz, in fact, Hortonworks proudly announced on Wednesday that the conference was the number 1 trending topic on Twitter the previous day.

Hadoop Summit 2 - 7-5-16

Hortonworks also announced that they will change the name of the convention from Hadoop Summit to Dataworks Summit and add more tracks like streaming, cloud, DevOps, etc.

During Tuesday’s Keynotes, the community committers in the audience were invited to come up to the stage for a round of applause.

Herb Cunitz and Rob Bearden (President and CEO of Hortonworks) pointed out that Big Data conversations are expanding from technical to business value. The Keynotes were a good mix of technical announcements and compelling customer use cases.

The use cases illustrate Hortonworks’ point that Big Data is transforming the way we do business. Progressive discussed their usage-based insurance pricing strategy. Capital One talked about fraud detection. GE and Comcast talked about predictive maintenance. Macy’s talked about having a 360 view of the customer.

Hadoop Summit 3 - 7-5-16

There are also more lofty goals to change the world. Microsoft talked about projects to improve children’s education in India by predicting school drop-outs. They also talked about solving world hunger by predicting the best time to sow crops, and crowd-sourcing the measurement of radiation levels. Arizona State University talked about improving breast cancer diagnosis. Hortonworks also announced the Open Source Genomics project, which is making it cheaper to analyze human genomics: $60K today compared to $100M in 2001, according to Joseph Sirosh, from Microsoft. This project ties into the White House Precision Medicine initiative.

On the Technical side, Hortonworks announced a technical preview of the Hortonworks Cloud Platform. They also announced the concept of Assemblies, which will allow customers and vendors to package end-user applications such as fraud detection using Docker, and deploy them through Ambari. This should make Hadoop a lot easier to use, by abstracting the setup of the cluster and services needed to run applications that can benefit from Hadoop.

Hortonworks also talked about the notion of a Connected Data Platform: all data, in motion and at-rest, across the Cloud and Data Center. Hortonworks CTO, Scott Gnau called it “all the data, all the time”, when he made a joint appearance at theCube with Syncsort’s General Manager for Big Data, Tendü Yoğurtçu, PhD to talk about why Hortonworks decided to resell Syncsort’s DMX-h, and how it’s helping customers onboard data and ETL applications into Hadoop.


Watch Tendü Yoğurtçu, general manager of Big Data, and Scott Gnau, CTO of Hortonworks discuss their recently announced reseller partnership to provide optimized “ETL on-boarding”

Scott and Tendü also had a session on Tuesday afternoon to talk about the challenges of ingesting legacy Data Warehouse and Mainframe data into Hadoop and integrating it with data in motion, whether on-premise or on the cloud. They talked about how Syncsort has been able to leverage its expertise on database and mainframe connectivity with DMX-h, and continues to innovate by adding support for streaming data and integrating Syncsort’s light-weight engine to run natively on MapReduce and Spark, by working with the open source community.

Tendü presented use cases of joint Hortonworks and Syncsort customers in Insurance, Media and Hospitality, who are able to create new business initiatives, extract intelligence from their data faster, save development costs by leveraging existing ETL skills, and reducing the load on legacy Data Warehouse platforms, and being able to try their applications on Spark without any redevelopment costs.

Hadoop Summit 6 - 7-5-16

This last statement drove a lot of excitement and moved one member of the audience to ask Tendü to elaborate on how it’s possible to run the same ETL flow on MapReduce and Spark without having to make any changes. Tendü explained that since our engine integrates natively with both computing frameworks and doesn’t generate code, Syncsort users enjoy the ability to develop ETL flows that are ‘future-proof’, as they can later be run on different distributed computing frameworks as the technology evolves.

Netflix had a session describing their experiments with taking some data flows developed Pig and re-writing them in Scala to run in Spark. They saw performance gains of 2.5 to 3x. Syncsort customers can try their flows in Spark by simply choosing the run-time computing framework to use. This makes development a lot less risky, and modernization efforts as simple as the click of a button.

There were also some discussions on Streaming data sources. Confluent had a couple of great sessions on Kafka. The Confluent team was at hand at the Syncsort booth to show off the new features of Kafka and do a joint demo of how Syncsort’s Confluent-certified Kafka consumer and producer are being used to enrich streaming financial transactions with at-rest data that came from a database on the mainframe.

Hortonworks was also at the Syncsort booth to do a joint demo of how Syncsort helps onboard data and ETL processes into HDP, and show off the integration of DMX-h with Ambari for easy deployment of DMX-h, and easy monitoring of the jobs.

A lot of the conversations at the booth were inquiries about how to move and access Mainframe data in Hadoop. This is not surprising. As Robin Purohit of BMC mentioned during the keynotes, 70 to 80% of all online transaction processing data is still on the Mainframe.

The summit also had some interesting sessions on Streaming computing platforms. DataArtisans had a couple of sessions on Flink and claims that Bouygues Telecom in France uses Flink.

Yahoo and Capital One presented a benchmark done at Yahoo a few months ago to compare Flink, Storm, and Spark Streaming on latency and throughput. The conclusion: Flink and Storm have lower latency but Spark can handle more throughput.

Hadoop Summit was a great place to hear about how organizations are leveraging Big Data to accelerate time to value, and learn about the new efforts being developed to make Hadoop easier to adopt.

For us at Syncsort, this was a very exciting conference. The support of our partners, from joint interviews, sessions, and demos with Hortonworks, to joint demos with Confluent helped validate our strategy and the value we bring to the community.

See the results of Syncsort’s third annual Hadoop Survey in the free eBook: Hadoop Perspectives for 2017.

Fernanda Tavares

Authored by Fernanda Tavares

Vice President, Data Integration R&D
1 comment
  1. David Normandeau July 5, 2016 at 6:32 pm

    Thanks Fernanda for the excellent summary of Hadoop Summit. It was an exciting four days with a lot of interesting discussions and presentations. Looking forward to next years Dataworks Summit.

Leave a Comment

*