Syncsort’s Paige Roberts recently caught up with Jules Damji (@2twitme), the Spark Community Evangelist at Databricks, and they enjoyed a long conversation. In Part 3 of this four-part interview series, we’ll look more at the importance of security to Spark users, the overwhelming move of a lot of Big Data processing to the Cloud, and what the Databricks Platform brings to the table.
In case you missed it. In Part 1, we looked at the Apache Spark community. And, in the second post, we covered how Spark and Hadoop ecosystems are merging, which supports AI development.
Paige Roberts: So, we’ve talked a lot about the new single API for Spark, a single API for Datasets and DataFrames. I can build my application once; I can run it in streaming, I can run it in batch. It doesn’t even matter anymore. I can execute it on this engine now, and maybe next year, I can execute it on another engine, and I won’t have to rewrite it every time. You won’t have to rebuild if it uses the same API. That’s very similar to a Syncsort message we’ve been calling it Intelligent Execution , or Design Once, Deploy Anywhere.
Someone asked at Reynold Xin’s talk, “What do you do when you go from RDD to DataFrames?” The answer was, “Well, you have to re-write.”
Damji: Yeah. We can’t quite do it that far back.
Roberts: Still, that’s a very exciting and appealing model for a lot of folks, designing jobs once and having them execute wherever without re-designing. One of the things I see that Spark has as a distinct advantage over everybody else is just the level of the APIs. They are so much easier to use, they are so much more robust. Even more so with version 2.x. That seems to broaden your community, and make it easier for the community to add to the Spark ecosystem.
Damji: It does make a huge difference in community support and participation.
So, one thing we haven’t touched on much is about the Databricks business model. How does it work?
That’s a good question. Hardly anyone has effectively cracked the code on how to monetize only on open source technology. Probably one of the few companies that a lot of newer company’s model on is Red Hat.
Red Hat had a model of saying, “We are going to take Linux, which is open source, and we are gonna add proprietary and coveted enterprise features on it to make it available and suitable for an enterprise. Then we are going to charge for a subscription and provide support and services with it since Linux is our core competency. We have the brilliant hackers who can write your kind of device drivers and that sort of thing.”
We know it better than anyone else.
Exactly. We know it better than anyone, so one added value is a core competency. Another is enterprise kinds of security, which you won’t usually get in open source out of the box or from downloading from the repo. Kafka is going the same way with Confluent right?
So, I think that’s the trend. Whoever provides the best experience for Apache Spark on their particular platform, is going to win. Databricks provides the best Apache Spark platform, along with a Unified Analytics Platform that brings people, processes and infrastructure (or platforms) together. We provide the unified work space with notebooks, which data engineers and data scientists can collaborate on; we provide the best IO access for all your storage. We provide enterprise-grade security for both data at rest and data in motion. And we provide a fine-grained pool of serverless clusters.
As more and more data is going into the Cloud, people are more and more worried about sensitive data, and how do you protect that? So, security comes as part of this augmented offering.
They are! A lot of our customers are banks, insurance companies, and they’re really concerned with information security.
Financial institutions are a good example, and we have customers in that vertical. Financial institutions are warming up to the fact that Cloud is the future, and a good alternative. We have the same vision. So, we provide this unified analytics platform powered by Apache Spark with other stuff around it, which is Databricks specific. It gives you this comprehensive platform, which differentiates between computing and storage, because we don’t tell you what storage to use.
Store it however you want.
Right.You can store it however you want. We’ll give you the ability to bring the data in quickly and process it fast and write it back quickly. All these different aspects of Databricks bring tremendous value to our customers: security, fast IO access, core competency of Apache Spark, and the integrated workspace of notebooks.
The data scientist and ETL engineers and business analysts can work collaboratively through the Databricks notebook platform. You bring the data in, you explore the data, you do your ETL, you write notebooks, you create pipelines. So, that’s the added features for our customers that come on top of open source. But underneath it is powered by Apache Spark.
Finally, you also get the ability to productionize your jobs using our job scheduler. And the ability to manage your entire infrastructure without worrying about.
And as long as you keep making Apache Spark better and better, and the community keeps jumping in and loving it, then you guys have got a good future.
Yes! If you try our Community Edition, you’ll actually see those benefits. If you start using our Professional Edition, you begin to see more. Every time we create a new release, we release it for our customers as well as the community. They get that instantaneously.
That’s about as fast as it gets.
Don’t miss the final post of this four-part conversation with Jules Damji, which features more about Spark and Databricks, and the advantages of Cloud data analysis.