Expert Interview (Part 4): Databricks’ Jules Damji on the Advantages of Moving Big Data Processing to the Cloud
We’ve been enjoying highlights the long conversation between Syncsort’s Paige Roberts and Jules Damji (@2twitme), the Spark Community Evangelist for Databricks. In today’s final installment of the four-part series, we’ll talk more specifically about Cloud platforms, and what the advantage is of doing big data processing and data analysis work on the Cloud, and some specifics about the Databricks cloud.
Paige Roberts: You talked about the Databricks Cloud a bit. You guys have got a nice Spark Cloud capability that you provide. Is it like a SaaS (Software as a Service), is it a PaaS (Platform as a Service), what is it exactly?
Jules Damji: Well, it depends on how you think about it and how you would use it. But primarily, SaaS. You are a scientist, and your company wants to create data science models and provide them as a service. In that respect, we are both SaaS and PaaS.
First, you’ll have to do some data ingestion and exploration. For you to run with this idea in Apache Spark, you’ll need clusters. You must go to a data center, hire a hosting company, start getting machines, install software. You must do all these things, find right libraries etc., create your own infrastructure.
Roberts: Hire some admins.
Damji: Yeah, you’ll need some admins. You don’t have all this money if you’re a startup company. And your core competency is data analysis and data engineering. All you care about is ingesting and exploring your data. You don’t want to worry about management. Apache Spark is great, but you don’t want to install it. You want the latest, and greatest, with all the fixes.
So, we provide Spark, along with platform built around it with additional software artifacts, as a service, running in the cloud.
You just want to use it.
You just want to use it. What do you do? You either do your own on-premise cluster, or you come to Databricks, right?
You get end-to-end, fully managed Cloud service with Apache Spark. You don’t need to worry about creating a cluster, you don’t need to know about managing it, you don’t need to know about tuning it. You don’t have to worry about monitoring it. You don’t have to worry about the reliability, SLAs and all that, all that’s taken care of.
You get a high level of SLAs from Amazon on EC2, and you get really beefed up machines. AWS has created all this competency around infrastructure that gives even large corporations confidence in them. S3 storage is very reliable. They provide quality control at 99.99999 – 11 nine’s.
That’s solid. And the Cloud saves a big chunk of time and money.
And then you can build things on top of that, right? You build your data science application, for example. The utility part, the stable infrastructure, is taken care of, the software is already installed for you.
A good analogy that I normally like to use is this: when Edison created electricity, his partner said, “What about if he created this as an electricity grid?”
Cloud computing is heading that way as a grid. I just come in, and I plug in my plug, and I’m guaranteed to get 110 Watts if I am in North America. Or, I’m guaranteed to get 220 Watts if I’m in Europe, right? When I plug in, I know I’m going to get that consistent service, and it’s will be reliable.
Take that analogy and say, “What if I’m able to do that with the Cloud?” I just go with one of the mega Cloud providers, and I’m guaranteed to get what I need. I’ll be able to scale, I’ll be able to get beefy machines, I’ll get the compute power. I’ll get the software components I need. I’ll get the storage power. I’ll get the reliability, I’ll get the bandwidth, I’ll get the throughput.
I get the security.
I get the security. What you must worry about is writing things on top of that?
Edison’s partner said that, “The people who are going to make money are not only the power utility companies that are gonna provide the power, that’s going to be a commodity. The people who are gonna make money are the people who are gonna build appliances on top of that.” Refrigerators, lamps, toasters, TVs, all those appliance manufacturers are making money. And they depend on the grid being already there.
Is Databricks the utility company in this analogy?
Databricks is not the utility company. We are providing a service on top of that on which you can use the Cloud grid. So, I’m providing you a refrigerator you can store things in, or…
A range, and you can create dinner on top of the range. Okay. I get it. You create data science applications on Databricks, which is the appliance, and the Cloud company is the grid.
Right. The providers are going to be Amazon with AWS, Google, IBM, Microsoft with Azure, and now, lately, Oracle. They are the cloud utility companies.
Does Databricks support all of those?
Right now, we are only on Amazon. But the whole idea behind data source API is the ability for you to get data from the other myriad places that you have it stored.
So, you can get data today from HBase. You can get it from MongoDB. You can get it from Redis. You can get it from all of these different storage areas. These data source APIs allow us to work with all those ecosystems.
Right. Well, thank you so much for talking to me, it’s been really cool.
Any time. I’m an evangelist. I’ve got to keep the lines of communication open.
If you missed any part of this four-part interview series with Damji, it’s not too late to catch up!
- Part 1: Apache Spark Community
- Part 2: Spark + Hadoop = Artificial Intelligence = Science Fiction Becoming Reality
- Part 3: Security, Cloud and Notebooks
For more talk about the future of Big Data, including more on Spark and the Cloud, read our eBook, Bringing Big Data to Life: What the Experts Say.