2015 was an exciting one for Apache Spark. A couple of huge companies took on major projects using Spark as the backbone (Netflix and Uber) and a few major players jumped in the ring, including Apache Flink and Apache Beam. 2016 has been equally eventful, with the announcement of general availability for Spark 2.0. Perhaps the most notable new feature for this version is Structured Streaming. What is it? Do you need it?
Structured Streaming is really more like a collection of new features that have been tacked onto Spark Streaming, instead of a radical alteration to what you already know as Spark and use with your data warehousing efforts. Essentially, the foundational principle of microbatching, the quintessential essence of Spark’s streaming architecture, is still alive and well.
Spark 2.0 Features Infinite DataFrames
DataFrames are an alternative to Spark’s native RDD primitive. DataFrames have the advantage of the Catalyst query optimizer (beginning with version 1.6), and when used in conjunction with DataSets are able to leverage dedicated encoders that deliver remarkably speedier serialization (deserialization) times.
Abstraction of Repeated Queries
Basically, repeated queries (RQ in data warehousing speak) mean that most of the streaming applications are viewed as repeating the same question over and over, such as, “How many new visitors hit our website within the past eight minutes?” Users conduct a query against the DataFrame, just as in previous versions of Spark. But users can also designate a “trigger” to specify the frequency that the query should run. Then specify the “output mode” for the query, plus a data sink to accept the output, and voilà. It’s just that easy.
Support for Ad-Hoc Queries
One of the other notable features of Spark 2.0 is support for ad-hoc queries. If you happen to want to know the number of visitors who hit your website in 20-minute increments over the course of several days, you used to have to establish a new process to communicate with Kafka stream and then build the query up. With the new Structured Streaming, you can simply connect the Spark application you’re currently running and toss in an ad-hoc query. This lets you get much nearer to achieving true real-time data collection. If you’ve got a full-time, around-the-clock data warehousing data flow going, the same feature that delivers ad-hoc queries also allows the Spark application to update its runtime operations in a dynamic fashion, which is much better for production environments.
Currently, Spark 2.0 is still in its early stages, so there’s likely to be some tweaks and changes forthcoming. At any rate, these improvements should serve to keep Spark relevant in the meantime.
If Spark is a vital part of your Hadoop data warehousing ecosystem, Syncsort’s latest release of DMX-h combines Spark Streaming with Kafka and Hadoop to help you achieve true real-time analytics for any application. Visit Syncsort now to see this and many other practical, cutting-edge, Big Data solutions.