Shortly after Strata in New York last year, Syncsort Big Data Product Manager, Paige Roberts, caught up with Sean Anderson (@SeanAndersonBD), who is in charge of product marketing for data science and engineering at Cloudera. In this first part of the interview, he provided a lot of information on Spark 2.0 new features and improvements, Spark on the Cloud, and Structured Streaming.
Paige Roberts: So what have you been up to?
Sean Anderson: Well, we had Strata, of course, and we recently participated in an Apache Spark Market Survey, so that was fun, with a company called Taneja Group. Now, I’m dissecting some of those survey results to try to understand the market a little bit better.
Roberts: Yeah. We just did a Hadoop user survey ourselves. So, tell me what interesting things happened at Strata?
Anderson: The big news there was the announcement for support of Apache Spark version 2.0. We were happy to push that out the door. That comes with some pretty cool new features. Obviously when you go up a point release that’s where they start to make the big moves. We are happy to report that all our testing and integration and further development has gotten to the point where we have a pretty healthy amount of users implementing it today.
Roberts: Yeah, I talked to Holden Karau at the last Hadoop Summit about Spark 2.0. She was pretty excited about it too. So, what can you tell us about it? What kind of cool new stuff does Spark 2.0 bring to the table?
Anderson: Across the board, we are seeing performance improvements. Whether it’s machine learning or Spark SQL or Spark Streaming, there are performance increases with Spark 2.0.
There’s a new feature called Machine Learning Persistence, where basically either a machine learning model or a pipeline to feed a model could be saved offline and loaded. This is really important because as we move models from development to production it’s now easier to simply implement the files and ensure all the pipelines and details of the model remain perserved. So when you think about machine learning in general previous to Spark 2.0, it would be hard because you’d write a machine learning algorithm. You could implement it and then, say you wanted to kill that cluster, because you are no longer using it. Without persistence, you would have to recode and launch all of those efforts again.
Now you have the ability to save and load those models. And when it comes to pipelines, those are also extremely cumbersome to code and implement again. So the ability to just retrieve those pipelines at a whim, and further create more repeatable workloads, that’s a big chunk of the new functionality in Spark 2.0.
That’s huge for cloud especially.
Yeah, and just in general, for the repeatability of these machine learning workloads. We’ve seen a lot of one-off type of machine learning activities. To make that a more repeatable process I think this is going to be pretty novel moving forward.
Excellent. Holden talked about the difficulty of porting models. Like if you train your Spark model, and then you try to port it over to a production application, that was a challenge. Does this help with that as well?
Yes, the ability to test that model out on a smaller subset of data, and then put it in production on much larger data sets. That’s generally how people would implement that feature.
As well as, like you said, transient cloud implementations. Awesome. That’s a very cool new capability.
On top of that, there’s some new algorithms supported in machine learning that I know some of the users are particularly excited about.
New algorithms are always good.
In Part 2 tomorrow, Sean will talk more about the latest improvements in Spark that support streaming and batch jobs.
Note: When this interview was conducted Spark 2.0 had just gone into beta. Today, Cloudera is in GA with Spark version 2.1.