2015 Strata + Hadoop World in NYC: Apache Spark Igniting Big Data Adoption
Looking back on my notes from February’s Strata + Hadoop World, it’s great to see the consistency of the themes and progress made towards the goals of platform maturity, real-time data analysis, and the adoption of Apache Spark.
In last week’s Strata + Hadoop World, organized by O’Reilly and Cloudera in New York, there were more examples of the use of Hadoop in production by large enterprises. As Paul Kent from SAS put it: it’s time to “celebrate Hadoop as a full-fledged ecosystem”.
The Syncsort team at Strata + Hadoop World 2015
A lot of the use cases highlighted by Cloudera and MapR during their keynotes take advantage of real-time streaming information. This is consistent with what we’re hearing from Syncsort customers as well. Consumption of streaming data is the next big deal. We saw a lot of traffic in our booth driven by the demo of our Apache Kafka support, which is currently in Technical Preview.
Progress has also been made on Spark, with Mike Olson from Cloudera claiming significant adoption by enterprises, and the announcement of the ‘One Platform’ initiative. A couple of weeks before Strata, Syncsort announced the contribution of an open source mainframe connector for Spark, available on Spark Packages. Earlier this summer, IBM announced their intention to support for Spark machine learning initiatives – further evidence of the traction Spark is getting in the Big Data ecosystem.
During Strata, at BigDataNYC 2015, our General Manager of Big Data, Tendu Yogurtcu, was interviewed on SiliconANGLE’s live streamed Internet show theCUBE, where she discussed how increasing demand for streaming use cases is driving interest in Spark, and predicted that Spark and MapReduce will likely co-exist in the near-term.
Tendü Yoğurtçu, general manager of Big Data, discusses forging a pipeline for data integration at BigDataNYC 2015 with SiliconANGLE.
Leveraging the maturity of Hadoop and drawing on the experience of others was of interest to a lot of the conference attendees. I attended both FINRA’s and Cloudera sessions on migrating legacy data warehouse workloads to Hadoop. Legacy data warehouse offload is one of the main use cases we see in our customer base. It is a very good first project to show value by freeing capacity and budget from the enterprise data warehouse. Both sessions talked about the value of being able to leverage SQL on Hadoop, during the early adoption phase. They faced some challenges related to differences in SQL syntax, and lagging functionality.
Alan Choi from Cloudera recommended that the best way to start the migration of data warehouse workloads to Hadoop is to identify candidate workflows and pick a simple one. In our experience, this is easier said than done. A lot of our customers had complex legacy SQL code that was hard to understand, and even harder to port to something that would run on Hadoop. In order to address that issue, we developed an internal tool to help visualize, understand, and ultimately migrate the SQL workload to Hadoop.
Jaipaul Agonus from FINRA is starting to look at Spark and mentioned it is important to abstract the execution framework from the business logic. We completely agree with him. By designing DMX-h with the concept of Intelligent Execution, our goal is to allow customers to build their data processing pipelines in a natural, point-and-click fashion, without worrying about whether to run in a distributed framework or stand-alone, on premise or on the cloud, as batch or streaming.
As Hadoop is more widely adopted in production, data governance is coming to forefront. During a Birds of a Feather lunch discussion, it was clear that a strategy is needed, but there are no clear winners yet. Joe Hellerstein of Trifacta and UC Berkeley had a session on the importance of an agreed-upon metadata services medium. Conversations with our partners show that there is support for such a vendor-agnostic strategy. At Syncsort, we understand the importance of governance and are working on integrating security from day one and engaging with partners like Cloudera, Hortonworks and MapR to support an open data governance solution that will best serve our enterprise customers.
With 6300 registered attendees, this Strata+Hadoop World NYC conference was bigger than ever before. The Expo hall had so many new vendors that a directory and aisle markings became necessary to navigate the floor. We are excited to be part of this vibrant community, and proud of our contributions to the success of Hadoop and related big data projects. We plan to continue our investment in Spark and help make it easier to deploy streaming data pipelines.