Expert Interview (Part 2): Sean Anderson Talks about Spark Structured Streaming and Cloud Support
In Part 1, Cloudera’s Sean Anderson (@SeanAndersonBD), summarized what’s new in Spark 2.0. In Part 2, he talks more about new features for Spark Structured Streaming, including how unified APIs simplify support for streaming and batch workloads, and support for Spark in the Cloud.
In Spark 2.0, the ecosystem combined the functional API’s and now you have a unified API for both batch and streaming jobs. It’s pretty nice to not have to use different interfaces to achieve this. There’s still native language support, and they are still very simplified and easy to use APIs, but for both of those types of workloads.
Roberts: Ooh! Streaming and batch together in one interface is something Syncsort has been pushing for a while! That’s great to hear. Very validating.
Anderson: Then the last improvement was around Spark Structured Streaming, which is a streaming API that runs on top of Spark SQL. That generally gives us better performance on micro-batch or streaming workloads, and really helps with things like out of order data handling.
There was this issue with Spark Streaming before where you may have outputs that resolve themselves quicker than the actual inputs or variables. So you have a lot of really messy out of order data that people had to come up with homegrown solutions to address.
And now that Spark Structured Streaming has essentially extensible table rows forever, you can really do that out of order data handling a lot better.
Streaming and batch seems like they’ve always been two separate things, and they’re becoming more and more just two different ways to handle data. We are also seeing a lot of push towards Cloud. What else are you seeing coming up that looks exciting?
For us, really understanding how we guide our customers on deploying in the Cloud is great. There’s persistent clusters, there’s transient clusters. For ETL, what’s the best design pattern for that? For exploratory data science, what’s the best for that? For machine learning, what’s the best for cloud based scoring? So giving customers some guidance on those aspects is key.
Recently, we announced S3 integration for Apache Spark which allows us to run Spark jobs on data that already lives in S3. The transient aspects of clusters makes it very easy to just spin up compute resources, and run a Spark job on data that lives in S3. And then you don’t have to spend all that time moving the data and going through all the manual work on the front end.
Really work on the data right where it is.
Exactly. That’s Spark in the Cloud.
Syncsort recently announced support for Spark 2.0 in our DMX-h Intelligent Execution (IX) capabilities. Be sure to check that out, and see what the experts have to say about Spark in our recent eBook.
Also, be sure to the read the third and final part of this interview: Paige and Sean talk about two new projects that Cloudera is excited about, Apache Livy and Apache Spot.