The major version release of Spark has been getting a lot of attention in the Big Data community. At the last Strata + Hadoop World in San Jose, Syncsort’s Big Data Product Manager, Paige Roberts sat down with Reynold Xin (@rxin) of Databricks to get the details on the driving factors behind Spark 2.x and its newest features.
Reynold Xin is the Chief Architect for Spark core at Databricks and one of Spark’s founding fathers. He had just finished giving a presentation on the full history of Spark, from taking inspiration from mainframe databases to the cutting edge features of Spark 2.x.
Paige Roberts: First, let’s go ahead and have you introduce yourself.
Reynold Xin: Okay, sounds good. My name is Reynold Xin. I’m one of the co-founders of Databricks. I’m also chief architect and have been leading the development of Spark with the company for the past couple of years.
Before Databricks, I was working on Spark as part of my graduate school studies at UC Berkeley. So, I’ve been working on Spark for a while.
Roberts: How long ago was that?
Xin: I started around 2011, so more than five years. I’ve been behind most of the major efforts of the past couple of years.
Roberts: We just came from your talk where you started the Spark history with IMS databases on mainframes, and you ended up at Spark 2.0. That was quite a journey.
Xin: Yes, absolutely.
I don’t want to make you re-do your whole talk, so let’s focus on the main changes between Spark 1.6 and 2.0.
It’s a big version bump. It’s the first major release of Spark other than the initial Spark 1.0. We focused primarily on three aspects. One is performance optimization.
We started rolling out the DataFrame API in Spark 1.3, I think, which lays down the foundation. Because it’s higher level API, now we have more room to do performance optimizations. If the user gives us a specific function, and we have to run it, there’s not much we can do.
With Spark 2.0 we were able to, in many cases, improve performance anywhere from 2x to around 100x. Through a couple projects, mostly project Tungsten.
Right. I did a blog post on Tungsten last year. Very cool project.
So, that’s the first focus area, performance. The second one is Structured Streaming.
We have heard from a lot of our customers that new requirements are surfacing in building real-time continuous applications. These applications have to make decisions nonstop on usually a live stream of data. And often these types of applications also have to combine with batch applications.
We started thinking about how we can actually build a new streaming engine and APIs that are suitable for these kinds of applications. Our end result was actually pretty simple. We just took the DataFrame API – there’s no change to it – added a couple small extensions, and then the users could express that streaming computational logic exactly how they would express the batch part. So, that is the second part, Structured Streaming.
Nice! Batch and streaming together, without having to learn a new API.
Last but not least, a lot more work is being done in SQL.
So, Spark 2.0 has become the most SQL 2003 standard-compliant open source Big Data query engine. We added window functions, sub-queries, and 2.0 can run every single one of the 99 TCP-DS queries, which is a standard benchmark, without modifying the queries. As far as I know, none of the other engines can do that.
This makes it much easier for business analysts to run their existing work, and port their existing business applications over to Spark SQL.
So Databricks and the Spark community have put a lot of emphasis on usability and performance. That makes a lot of sense.
In part 2 of this interview, Reynold Xin gives us some good information on the differences between stream and Structured Streaming, how to integrate Structured Streaming with Apache Kafka, and some hints about the future of Spark.
Download Syncsort’s latest white paper, “Accessing and Integrating Mainframe Application Data with Hadoop and Spark,” to learn about the architecture and technical capabilities that make Syncsort DMX-h the best solution for accessing the most complex application data from mainframes and integrating that data using Hadoop.