Data infrastructure optimization, availability & security software
Data integration & quality software
The Next Wave of technology & innovation

Expert Interview (Part 2): Holden Karau on the Future of Spark and the Open Source Community

At the Strata Data Conference in New York City in the fall, Paige Roberts of Syncsort had a chance to speak with Holden Karau, who was then at IBM, and is now a Developer Advocate at Google. Ms. Karau is also a Spark committer and the author of Learning Spark. In the first part of this interview, she discussed her newest book from O’Reilly, High Performance Spark. In this part, Roberts and Karau go more into what the future holds for the latest versions of Spark, and the open source community.

Read our eBook: Unlocking the Value of Legacy Data

Roberts: What’s new that has come out since we talked last? I know we talked about Spark 2.0, but now we’re up to Spark 2.1.2.

Karau: Right now I’m running the Spark 2.1.2 release and if that closes soon, I will hopefully run the Spark 2.2.1 release. Then hopefully we’ll see Spark 2.3 at some point, not too far in the future. (Edit: Spark 2.3 now released!)

So, what are you excited about that’s coming in 2.3?

The thing which I’m really excited for about Spark 2.3 is it’s worked on by multiple companies which is always very exciting. Two Sigma and IBM and some other folks have been involved on this project called Apache Arrow, which is really quite exciting, and it’s also being integrated into Spark 2.3 which is also very exciting. I’m a Python person and a Scala person and for someone who straddles those two worlds, Apache Arrow is the closest thing that I get to magic. It’s the columnar in-memory data format that I can write to and manipulate from both sides. You can access it in the Java API’s, and you can access it in the Python API’s.

Initially in Spark, we don’t have the shared memory buffer, so I do copy and write support for this, but the really exciting thing is that, by integrating Arrow, we’re moving in the direction of being able to do maybe not zero copy, but closer to zero copy interoperability with things like Python. Nvidia is using it in a lot of the GPU acceleration stuff.

The thing that you’re going see in Spark 2.3, okay, it accelerates this one tiny thing which doesn’t seem super exciting. The part that excites someone like me is that, as this goes forward, it can go to more and more places, and we can start using it more intelligently, and essentially stop paying this cost of living in the JVM. We need to be able to work with people who aren’t JVM developers, whether that’s PySpark developers or the fancy deep learning people.

That’s pretty neat.

It’s honestly the most exciting thing happening in Spark in my mind. And not even just Spark, it’s the fact that we’re all coming together and working together to share our toys.

It’s great to see that happening more and more. It seems like the big data world is pulling together more, and pulling in the same direction. I like seeing that, too.

Yeah, and that’s really exciting to see.

Syncsort added Spark 2 support, too. We’re in there doing our part.

Yay! Thank you for using Spark 2.

So, you just did your presentation. Before we go, what’s the most interesting part of your presentation that you want to share?

I think what’s fun about extending Spark ML is it’s essentially an excuse to go off, read a bunch of really cool research papers, screw around with some random tech, and see if we can do better than everyone else. Most of the time, we’re not going to win that dice roll, but I think some other times we are, and it’s going to be really amazing.

Historically, there’s been this big push to put models inside of Spark, and I understand that, but I’m hoping that by having this solid set of API’s, we’ll start to see more people able to just go and do cool stuff on their own, without having to have Spark people sign off on them doing the right thing. At the end of the day, there are way more smart people than there are Spark Committers available to review these changes. I’m really excited that the standardization might open it up to allow more people to make really cool stuff happen.

The more people who roll the dice, the higher the probability that somebody is going to come up with something awesome. Thanks for taking the time to do this!

Thank you.

Make sure to check out our eBook: Mainframe Challenge: Unlocking the Value of Legacy Data


Related Posts