Back in March at Strata + Hadoop World, Syncsort’s Paige Roberts caught up with Jules Damji (@2twitme), the Spark Community Evangelist for Databricks, and had a long conversation. In today’s first part of this four-part series, we look at one of the important themes that came up again and again: the importance of the Apache Spark community to the continued development of Apache Spark, including the depth of the Databricks relationship with the Apache Spark community and how that affects the development of Spark over time.
Over the next few days, we’ll also talk about how Spark is merging with the Hadoop ecosystem and supporting AI development, as well as Spark security, and the big move of Big Data processing to the Cloud
Paige Roberts: Let’s start by having you introduce yourself.
Jules Damji: As the Spark Community Evangelist at Databricks, my job is to reach out to the developers of Apache Spark and evangelize both the benefits of Apache Spark and of the Databricks Unified Analytics Platform.
More importantly, my job is to listen to the community: it’s a two-way conversation. I think that’s an important part in any advocacy. If you really want to win the hearts of the developers, you must understand what their needs are. That’s the crucial link that I have between the community and Spark engineering group within Databricks. It’s an important role and I just love it.
Roberts: For three years, I had the title of Hadoop and Analytics Evangelist so I totally understand the role. How do you perceive your relationship to the Apache Spark community?
Damji: When Databricks was founded, we realized that Apache Spark was this new big thing. We wanted to give it to the community. We wanted to make sure that the community could contribute to it. We believed from early on that innovation happens in collaboration, not in isolation.
So even though we produce and contribute a lot of the code, we also take a lot of the community code, and one of the ways we do that is through evangelism. I go out there and go to meet-ups. I listen a lot to find out what the community wants, and I bring back those requests from the developers and community, and give feed back to the Databricks engineering team. Then Databricks, and the community, create pull-requests in GitHub to create those new features.
I talked a little about the new cool technology in Spark 2.x and about Structured Streaming with Reynold Xin, the Chief Architect from Databricks. What does Databricks as a company bring to the conversation with the community?
So, what Databricks as a company brings to the community is the stewardship to ascertain with help of the community what new features are needed, and to ensure features get in through the release cycles by the PMC release managers. Like today, we attended the Structured Streaming talk, and there were a lot of things that came up that we hadn’t thought about. We hear from our customers, we hear from our vendors, we hear from the community, and we hear from developers who are working with other customers who are building these applications.
Can you give me an example of a feature like that?
Okay. For example, today, watermarking is a feature that came about from customer need. What do you do with events after you drop them? In certain fraud detection scenarios, we want to keep the events. Suppose you need a certain event after 10 minutes. What we do right now is we just drop them. But there might be a need for auditing, for example, in the government sector.
What if there was a dispute of some sort? If you don’t have that particular record, because you dropped it, there’s no way to reclaim it. There’s no way to investigate. So, having those watermark events is one example of where the community raises the issue and we have the ‘Aha’ moment and we say, “Yes, we should implement that.” That’s a big thing that Databricks does. Databricks brings leadership and stewardship to the Apache Spark community.
What else does Databricks bring to the table?
Well, if you run our Databricks Unified Analytics Platform, we have the core competency to make your Spark experience the best, in more ways than one. And obviously, there are some additional technologies and benefits that come with the Databricks platform, that are added on top of Apache Spark, that makes Unified Analytics Platform the best place to run Apache Spark. But at the core, it will always be powered and anchored by Apache Spark. It will always work in conjunction with the community. The whole premise of Databricks from the beginning is: community is vital to the vibrancy of any technology.
Don’t miss the next part of this conversation tomorrow in Part 2! We’ll discuss the trend of the Spark and Hadoop technologies and communities merging over time, and how that’s creating a science fiction novel kind of world, where artificial intelligence is becoming commonplace.
Download Syncsort’s white paper, “Accessing and Integrating Mainframe Application Data with Hadoop and Spark,” to learn about the architecture and technical capabilities that make Syncsort DMX-h the best solution for accessing the most complex application data from mainframes and integrating that data using Hadoop.