In Part 1 of this interview, IBM’s Holden Karau and I began a discussion on Hadoop, ETL, machine learning and Spark. Here’s more insights from my chat with Holden at Strata + Hadoop on use cases for ETL in Spark for machine learning, where Scala and Python are used, and her predictions about the future of Spark.
Well, I’m always interested in use cases, how people are putting the tech to work. Spark is your area. What do you see as the most common use cases for Spark? What do people really do with it?
I think frankly at the end of the day, there are a lot of traditional ETL jobs in Spark. People get fed up with writing MapReduce jobs for ETL, they come to Spark. And I think certainly there’s also a fair number of people that are using Spark for machine learning, but it has some really challenging problems that make it difficult to use.
For machine learning?
Yeah, it’s great at training models, but model export and import, and serving time things are also incredibly important. We don’t just train models for the sake of training models. Well, sometimes you train a model to understand your data better, and that’s okay, that’s cool. But normally, we’re training a model to actually predict stuff.
And you have to be able to export the model for that. Otherwise…
It’s not very useful. Sure, there is limited import export. But several companies that I’ve talked to, they have had to write their own import export layer for the models they care about because the default one in Spark just isn’t up to snuff. That definitely means that they’re using it and they see the value in it, but there’s still this pretty substantial undertaking to really be able to use Spark.
For machine learning.
For machine learning. For on line tech predictions, in particular. If you’re only going to be doing batch predictions, it’s great. It works fine. You can just keep your model in Spark. You can save it. You can load it back to Spark. But you probably don’t actually want to have all the spark libraries on your edge nodes for serving.
So, if you’re trying to predict stuff as it happens in high speed…
Yeah, you can’t just use Spark. You need independent training, and you need independent predict API’s.
And in order to do that, you’ve got to export your model.
Yeah. With time, I think we’ll start to see more of the export-import functionality along with better serving functionality to support those. I think Spark machine learning is actively used, but there are some hurdles preventing it from being as actively used as it could be.
I always thought that was Spark’s sweet spot.
You know it’s really awesome at it.
It never occurred to me that exporting models would be a road block.
It’s challenging compared to other tools. To be fair, part of the reason is that there isn’t really a super great industry standard. There’s PMML, but PMML is really frustrating, it’s not a great format to work in either.
But it’s about the only game in town. PMML is the only one where anybody even thought about: I want to be able to move models from here to here.
Yeah, it’s the only one which people thought about, but it’s not nearly expressive enough frequently, which is why you see a lot of people having to write custom code for their models, and for their serving.
Well, do you think the solution is to improve PMML? Or come up with a whole different way to do it?
I think improved PMML export in Spark could go a long way, and that’s something I care about, but I don’t think it’s necessarily a short term solution.
That’s going to take some time to accomplish.
Yeah, and so there’s some work to make some of the models in Spark not depend on the rest of Spark. So you can use them as a serving side, if you’re fine with doing JVM for your serving. I think we’ll see that come maybe in the 2.0 time frame, maybe later for 2.5. I think Spark will figure out a good solution. I don’t know which one it’s going to be yet. It’s also possible that one of these other companies that have made their own import-export logic, will open source it, and it will be good enough that people will find it interesting.
They’ll go with it.
Well, before I turned the recorder on, we were talking about interfaces. The first thing you said when I asked about use cases was that Spark is being used mainly for ETL. So is everybody who’s doing ETL with Spark, are they all writing Scala?
I think a lot of people are doing their ETL on Scala, yes. Traditionally, you see Python used more for exploratory and analytics work. For ETL, there’s a performance overhead to using Python. It doesn’t give you really all that much that you don’t get in other languages. So most people do their ETL work in a JVM language like Scala or Java…
Yeah, I saw your presentation at Data Day Texas and also, um, I can’t think of who it was, maybe Sarah Guido, but somebody else did a presentation on PySpark. She was griping about basically PySpark is just not really even with Scala. Like, she had a problem, and she posted it on the forum, and the comments she got on how to fix it were all essentially, “Write it in Scala.” There are other interfaces and development environments appearing, other ways to write Spark jobs now. I hadn’t heard of the Jupiter one until recently, but the DataBricks notebooks I knew about, and Zeppelin. Now graphical user interfaces are starting to appear. KNIME has early Spark support, and Syncsort has Spark support in the new version. So you’ve got at least the beginning of that. Is that the direction it will go over time, do you think?
I think you see a lot of tools building on top of Spark. In the long term, I think that’s actually right. If Spark is successful, some developers will still work in it directly, of course, but the abstractions will be built on top of Spark for specific verticals and tools that they need. Not all of those things belong inside of Spark itself. Spark can be a great execution engine for a large number of problems without having to add specific support for each of those verticals. As far as language integrations go, I think we’ll see improvements certainly with datasets coming along. They made a number of potential improvements, and I think we’ll continue to work on that, but it’s a hard problem.
I think Tungsten has a lot of potential. If our data at rest or in its representation form isn’t in the JVM, if it’s in just regular memory, there’s the potential to make it accessible without having to re-serialize things quite as often, but that becomes very complicated. There’s a lot of things that tungsten will make possible, but maybe not easy.
Tungsten is very cool tech. I blogged about that. I think the direction they are going is pretty interesting. I’m glad to see they’re finally doing some really smart stuff. In that vein, what cool thing are you working on right now?
I am obviously excited about the next book that I’m working on. It’s in early release now. I’m biased of course, but “High Performance Spark” which I’m working on with Rachel Warren, is a new O’Reilly book that’s very cool. We’ve got the first four chapters out. We’ll probably not hit our scheduled deadline, but my plan is that this year we will finish the book. There will be Spark 2.0, so we can update everything and have the new API’s all correctly covered.
So it will be current.
Yeah, so our actual release date will have to lag Spark’s. Admittedly, blaming Spark for a slipping release deadline on the book is a little hokey, but you know we can pretend that’s why the deadline slipped.
[laughing] Well, Spark is a moving target.
It was great talking to Holden and getting caught up on the current and future state of Spark. It was pretty much what I expected from her: Lots of good advice for Spark implementations, honest answers on the strengths and weaknesses of Spark, and eagle-eye vision of where Spark is going. Be sure and watch for her new book, “High Performance Spark,” which will undoubtedly come out shortly after Spark 2.0.
For more on Hadop, review these take-aways from Hadoop Summit 2016, including a presentation on ETL use cases with Hortonworks and Syncsort.