Expert Interview (Part 1): Holden Karau on Her Latest Book and Upcoming Spark Developments
At the Strata Data Conference in New York City in the fall, Paige Roberts of Syncsort had a chance to speak with Holden Karau, who was then at IBM, and is now a Developer Advocate at Google. Ms. Karau is also a Spark committer and the author of Learning Spark. In the first of this two-part blog series, they discuss the release of Karau’s newest book from O’Reilly as well as some upcoming new developments in Spark.
Roberts: For our readers, let’s start with your name and what you do.
Karau: My name is Holden Karau, I’m a Software Development Engineer, and I actually have a new book that just came out.
Tell me about it.
It’s called “High Performance Spark,” it’s from O’Reilly. The focus is on picking up where “Learning Spark” and a lot of other Spark books tend to leave off; which is where you know how to make a Spark job work but you don’t necessarily know how to scale your job, or how to tune your job, and really, how to do all the little tricks that you have to do to make your things actually work. In a lot of introduction books, we prefer to talk about the simple things, and while the simple things work, …
Reality is never simple.
Yeah, reality always has some complications to it. It’s a great book, and I got much better royalties on it, so I think everyone should buy it.
<laughing> Definitely. Well, I have a question related to the last time we talked, you said one of the big problems with Spark was training your models and then deploying them. Since then, some cool tech has come out. What do you think about that?
There’s been some interesting things that have come out on the serving side. There’s some even more interesting things that may be coming out some time later this year, or possibly early next year, but I can’t really say anything in detail on that. I can say that it’s being worked on by a variety of people at different companies who are working in a similar space together. I think that’s really exciting because on a lot of these projects, you see the serving layers come out of a specific company, and they’re very well-suited to that company’s need. But since this is being developed by an assortment of people, I’m hoping it will generalize better.
Well, I look forward to talking to you about that when it comes out, then! For Spark model serving, what are people using now?
For model serving, MLeap is popular, but really what most people seem to do is use their own custom stuff which is unfortunate. It’s actually not necessarily their own custom stuff, but they export it to whatever serving layer that they have in house. That might be custom, it might be Open Source, or it might be something else, but everyone essentially is doing their own export layer to something they know how to serve. That works, but it makes it …
Not easy to share.
Right, and it means there’s a lot of duplicated work, which as an open source developer makes me kind of sad.
Make sure to check out part 2 when Roberts and Karau touch on what the future holds for Spark.
Make sure to check out our eBook, 6 Key Questions About Getting More from Your Mainframe with Big Data Technologies, and learn what you need to ask yourself to get around the challenges, and reap the promised benefits of your next Big Data project.