Expert Interview Series: IBM’s Holden Karau on Hadoop, ETL, Machine Learning and the Future of Spark
Anyone who has ever searched for a good book on Spark, has seen Holden Karau’s name on the cover of some of the most practical O’Reilly books on the subject like Learning Spark and High Performance Spark. Holden is a dedicated Spark and PySpark committer with a unique perspective on how Spark fits with the Hadoop ecosystem, why ETL and machine learning are where Spark shines, and what the newest version of Spark has in store for us all.
I first encountered Holden a few years back. She was helping to teach a basic Spark introduction class at Strata + Hadoop World in 2014. She worked for Databricks back then. These days, she does her coding with IBM in their Spark Lab in San Francisco. If you get a chance to catch one of Holden’s presentations, grab it. She has a unique style of humor, plain speaking, and deeply practical advice that you shouldn’t miss. I chatted with her a bit after her presentation at Data Day Texas, then caught up with her again at Strata + Hadoop in San Jose in March.
My own schedule has been crazy enough that I’ve barely had time to breathe, much less blog. Now, with Hadoop Summit just around the corner, I’m finally getting a chance to share some of what Holden and I chatted about. Since it’s been a few months since we talked, some of what she predicted about the future of Spark has already come true in the technical preview of Spark 2.0.
Here’s what we talked about:
What do you think of the Data Lake/ Data Hub concept?
I think it’s really easy to store a bunch of stuff which is pretty useless, and I think there’s a lot of push occasionally to store things up not necessarily thinking through why we should be storing them. In some ways, it’s very useful because we don’t always know what it is we’re going to need in the future, but in my experience, people have often not done enough thinking about what formats they want to store something in, or what things they want to capture. So they are just doing raw dumps which are not very useful downstream.
I think like it can be really cool to have a Data Hub. It can be great, and it can hold a lot of things, but I think it’s also really easy to just make a pile of garbage. You have to set your corporate vision right. If you just think you must have a Data Hub, you’re going to get a Data Hub, but it’s not going to be useful. You have to think through some things beforehand. This is especially true with formats that people are storing things in. Often, they really haven’t put enough thought into what format their stuff should be in, so they are storing it in ways which make it almost unusable in the future. They’re going to have to do full table scans, full parses of everything. That’s just … you’re setting yourself up for failure.
That’s a good point. A lot of data is great, but only if it’s accessible. What do you think about current Hadoop metadata management strategies?
Yeah, there’s a few. I don’t think they’re great … yet. I think there hasn’t necessarily been enough work into evolvable schemas which tends to be what your actual data ends up being like.
For example, my data from last year won’t have all the properties of today’s data, and I might have some properties of today’s data I no longer have next year, right? Certain data sources went away, you got sued, or sold the company, or whatever. Some things happen, and this data isn’t there, or we’ve added things. A lot of people have these really, sort of rigid things. That doesn’t work great.
Parquet, for example, can do this metadata merging on the read side. Evolvable schemas is the idea that we’re going to have fundamentally similar data with slightly different schemas evolving over time. It’s something that doesn’t have a lot of great tooling around it yet, even though it seems to be how most data is in a longer term of view. Message formats change.
That’s an old problem.
It is. Some people have it right to some degree. And sometimes people solve it by making monstrosities. A pretty common solution to that is: I have my schema, and then I also have this miscellaneous bag. Then over time, your schema becomes useless and all those fields of junk are in a miscellaneous HashMap and you’re just like, “Ah cool, I guess I’ll go and look at some strings.”
Everything’s in a BLOB, cool. Good to go. [Laughter]
Yeah. So, I think that’s a design pattern to stay away from. You should be more careful than just sticking a HashMap of miscellaneous stuff at the end of your records. That’s what I think. But I’m very biased to the random problems that I see, so my views are maybe not representative.
Actually, one of the problems we see a lot even with mainframe data is that over time the data has moved, so that it no longer matches the copybook. This is so not a new problem.
I really don’t understand why we still have this problem. We have so few tools for this. Probably because it’s boring to make these tools, it’s no fun.
They are not the cool kids.
Yeah, no. The cool kids don’t write schema migrations. [Laughter] Even though what you need is schema migrations.
Yeah, it’s true. Okay, let’s jump to an easy question for you. What do you think of Spark?
I’m obviously pretty biased, but I think Spark is awesome. I’m really excited about Spark 2.0, and where it’s heading. It looks like we’ll be able to get rid of some parts of the API which maybe haven’t been so well thought through. There are some things that I wish we could kill that I don’t think I’ll be able to convince people of, but I’m really excited about the direction. I think also, as a project, it’s evolving to a certain degree. Historically, in Spark, it has not been uncommon for people to create an issue, and then with no discussion just go ahead and write the code and do the change. And that’s okay when you’re like six people working together in the same lab but …
With a thousand committers you can’t get away with that.
Yeah. I think the project is really learning how to cope with these challenges. They’re not making hard and fast rules, but socially, the expectations are evolving. So I think Spark 2.0 will be some good technological changes, but also some great cultural evolution.
Okay, you said you were really excited about Spark 2.0. What’s the thing about it that we should really be excited about?
In my view, datasets are pretty much the most exciting thing in Spark 2.0. It’s this ability to mix relational and functional transformations without having to do a lot of groundwork to do that. In the future, datasets will to a large extent, replace RDD’s for representing your data. I think there are a lot of things that need to be improved with Spark SQL before that’s actually feasible, but I think that’s been a direction that people have chosen to go in. I think we’re also going to see a much more powerful SQL engine coming out of Spark, so I think that’s really good for everyone.
It sounds like it’s almost turning into, over time, kind of a giant database.
In some ways, yeah, but datasets still work for arbitrary code execution, so it’s not just for SQL queries. But certainly the optimizer starts to look more and more like a SQL optimizer, and less like a traditional graph optimizer. Things look more like compilers rather than like libraries and that’s quite interesting.
There is also a trend to start to move things out of Spark that don’t belong in Spark. Some of the integrations to third-party data systems, for example. The ones that were in Spark itself were sort of randomly chosen, like whoever happened to be around earlier on wanted to talk to those data sources, so those were the ones that were first party.
And now, since we’re refreshing our API’s anyway, we’re looking at which ones make sense to actually have tightly coupled to the release, and which ones should have a separate release cadence. I think that makes a lot of sense. Like kafka is cool, but it’s not going to time its releases with Spark, right? So, you won’t want these connectors to be closely tied to Spark, you want to be able to choose different connectors.
You want to be able to release the new version of Spark without worrying about when the next version of Kafka is going to come out.
Right. We’re all Apache projects. We’re all friends. But, dear God, no one wants to solve that coordination problem.
The QA alone would drive you bananas.
Although to be fair, the downside of this is that the open question becomes, who is QA’ing the connectors? So, it’ll be interesting to see it develop, and also interesting to see who actually develops the connectors that were previously first party connectors. If they continue to be developed by the Spark committers, there’s some interesting questions that have to be resolved. I think it’s a really good thing that these questions are being asked around the release window when there’s a chance to actually change things from how we’ve been doing it previously.
And Spark 2.0 is a big release. It’s a full on version change. Now is the time to do the big changes.
Yeah, so there’s talk of dropping Java 7 support.
You know what I mean? If you’re going to do that, you might as well break some other things too.
[Laughter] There you go.