Data infrastructure optimization, availability & security software
Data integration & quality software
The Next Wave of technology & innovation

Expert Interview: Rachel Warren Gives Some Valuable Tips on Tuning Spark Jobs

At last year’s Data Day event in Texas, Paige Roberts of Syncsort had a chance to speak with Rachel Warren, Senior Data Scientist at Salesforce. Warren discussed how she got involved in the tech field, as well as providing some valuable advice for debugging and tuning Spark.

Roberts: First off, can you tell our readers a little about yourself?

Warren: My name is Rachel Warren. I am a software engineer/data scientist, currently employed for Salesforce Einstein, and I’ve been doing some work with Spark. I worked together with Holden Karau, who I met at a previous company, to write a book called “High-Performance Spark,” which came out May of last year. I’m based in San Francisco.

Are you working on anything now that’s really interesting?

Yes! I’m actually working for a team at Salesforce that’s doing some auto machine learning with Spark for Salesforce Einstein, and continuing to think about the model serving problem. Basically, my experience working in the data field is that it’s pretty easy to develop a machine learning algorithm, or to use some existing things to build a proof of concept, but it’s really hard to put these kinds of insights into production. It’s pretty interesting work. And in my spare time, Holden and I are working on something to do automatic Spark settings tuning. That’s something people spend a lot of time and resources doing.

Expert Interview - Rachel Warren Gives Some Valuable Tips on Tuning Spark Jobs1

Just turning the knobs, yeah. It’s one of the reasons we developed our Intelligent eXecution capabilities in DMX-h, to do some of that auto-tuning intelligently without people having to spend a bunch of time and resources on that. So, you’re working on a way for it to just be in a good place without people having to mess with it. That’s great. So, how did you get into this field?

I’m actually a fourth-generation tech. My great-grandfather ran an electronics distributor, my grandfather took over the business, then both my parents worked at Microsoft in the early days. So, I grew up in Seattle, in this tech world. It’s always been on my radar, but I always said I’d never do what my parents did, and be a computer programmer.

It’s clearly uncool if your dad does it.

Yeah, totally. But, I was in college, in liberal arts, and it didn’t feel tangible. I took a computer science class and it was really fun. It felt like this great, practical thing. It’s really satisfying, that feeling of building something that works. And the process of building it is also intellectually …


… rigorous. Yeah. It’s not the same as writing an essay. You get something out of the process that functions and is concrete. And it turns out it’s a pretty lucrative thing to know how to do.

[laughing] Yeah, it pays the bills. I saw your and Holden’s presentation on de-bugging Spark earlier today. Do you have any advice for people who are struggling with Spark? Can you share any good tips or tricks?

It’s not always the easiest system to use, so a lot of it is just attitude and patience. It’s important to remember that before Spark, if you were going to do the same kind of thing, you would have had to write MapReduce and like, hundreds of Java classes.

It used to be far worse than it is now.

Exactly. It’s actually really easy to get started with Spark. It’s a little hard to make it work really, really well. I think a lot of people get too deep into the tuning weeds. They write some code, then run it over and over again to tune the memory settings to get it to work. That’s really the wrong approach. You want to sit down and look at what you’re trying to do. Make sure you understand the API that you’re using and how it’s being evaluated, and whether you’re really thinking about data processing the right way.

Focus on your end goal and what you’re trying to accomplish, and if the code is really doing that.

Yeah. Absolutely. You can lose sight of that when you’re busy turning all the knobs. When it comes to data processing, you have a lot more flexibility than you might think. You could break stuff up into different tables if that’s going to make it process faster, or you could change the way it’s keyed, or … you just sometimes need to take a step back from the system.

Try a new method rather than trying to tune the same method over and over.


Was there anything you wanted to promote, anything you want me to plug a bit in the blog posts?

Sure, buy the book, High Performance Spark.

That’s a good plug.

It’s a spark plug. [laughter] But seriously, the Spark web UI is really good.

Did you work on it?

No, I haven’t worked on it. But you can run the applications, and it gives you a dashboard to do real-time monitoring. You can really see what’s going on, what’s slow and what’s failing.

That’s always helpful.

It’s really worth the time to configure that.

Thanks, that’s really useful information.

Make sure to check out our eBook, 6 Key Questions About Getting More from Your Mainframe with Big Data Technologies, and learn what you need to ask yourself to get around the challenges, and reap the promised benefits of your next Big Data project.

Related Posts