Expert Interview: Doug Cutting, Cloudera Chief Architect and Hadoop Co-Founder, Part Two
In Part 1 of this interview with Cloudera’s Chief Architect, Doug Cutting talked about how he got started in Big Data software, Cloudera’s role in recognizing the importance of Hadoop for businesses, what trends drove Hadoop’s growth, and what broad-based business successes Hadoop is now driving in Big Data.
In Part 2, Doug discusses with Syncsort’s Paige Roberts what he is working on now, the launch of Apache Spot, and how to help organizations stay on track with open source, both on-premise and in the cloud.
Syncsort’s own Paige Roberts sits down one-on-one for a candid discussion with open-source guru and Hadoop creator, Doug Cutting of Cloudera
Paige: What are you working on right now?
Doug: A number of different things. I spend a significant chunk of my time out on the road communicating with folks, trying to spell out this vision of where things are going. That’s probably close to a third of my time is spent doing that.
I also still do a little bit of development. I try to help out where needed in engineering and bringing people up to speed on things that I still may know better than other people. So, I’ve been doing some of that lately.
Also, I’m formally part of what we call the strategy office. There’s three of us in Cloudera. We’re kind of a skunk works in some ways. We’re trying to solve problems and set a pattern for how Cloudera should be solving problems.
So, one of the things we worked on recently was, how do we help non-profits? What’s Cloudera’s model going to be for how we can assist people that we think deserve access to data tools but probably can’t afford to pay us? So I worked with Thorn, who was a winner here last year, and came up with a pattern that I think we can repeat again and again with other non-profits for how we can provide them with assistance.
Paige: That’s awesome!
Doug: That was kind of a fun project. Another one you’re going to hear more about tomorrow [at Strata] is cyber security.
Paige: I heard about that. Apache Spot?
Doug: Yes, we’re launching this as Apache Spot. It’s a new project. And the exciting part of that for me is we’re trying to develop some open data models. So, for cyber and for network data and data about users. So, you’ve got this sort of software stack, which everybody shares. But then, what happens is that each application ends up having its own schemas for the data. So they can have trouble sharing anything above the software layer. We can develop some common schemas for different industries and for different verticals, starting with cyber.
Paige: Some standards.
Doug: Some standards. Say, these are the formats. Say, if you’re going to put it in HBase, this is the way you ought to do it, and here are some tools to help you do that. So we can actually have some open source projects which implement these standards.
It’s not just a document. There’s actually some code that helps you glue things together. Then we can get a lot of different vendors with different kinds of applications. They can share the data, so you don’t have to have multiple copies of your network data for different purposes. So, I think we need to do this together. I think that, similarly, we need this in healthcare. We ought to have standard formats within the ecosystem. I mean there are some.
Paige: Yeah, like EDI. It’s a standard, but there’s like a million different ways to implement it.
Doug: But they’re also not formats that are friendly to the Hadoop ecosystem. So, how do we translate these into the Hadoop ecosystem, genetic data and so on? And there are some efforts in some of these areas already. But I think that’s a neat area for Cloudera to work on.
We don’t want to go into actually building vertical applications and vertical solutions. We want a platform. But we also need to help enable the platform to be effective in different verticals. So we’re starting to look at data formats that are specific to industries. I think it’s a good direction for us. That’s the part of the cyber thing that I particularly am interested in.
Paige: Can you tell me more about Apache Spot?
Doug: We’ve been working with Intel for a couple years on this project – ONI [Open Network Insight] is what they were calling it. And this is just taking that into Apache. It’s been open source actually all along. Intel had it on GitHub. It has some data formats in it, but its primary focus has been on some analytics to help you identify threats, and that’s great stuff.
We want to keep working with Intel on developing that further. But we also want to really focus on getting a broader set of relevant schemas and data formats for data that can be used for other kinds of analytics in cyber and other kinds of predictions. Emphasizing that side of it more is what we’re hoping to do with Spot going forward.
The first thing we want to do is bring it to Apache so we have it some place that’s easy for lots of people to get involved in and collaborate. If it’s going to be a standard then Apache’s the right place to have it.
Paige: That makes sense. Yeah. I talked to a friend of mine, Ryan Merriman. He works as one of the architects on Apache Metron. I was wondering what’s the relationship between the two? How are they different? Is there a relationship?
Doug: Not really. Metron came out a while ago. I think they’re similar in a lot of ways. They came out almost the same time – almost the same month. They’re sort of two parallel efforts.
We’ve been collaborating with Intel from the beginning on ONI, so that’s the one that we’re comfortable with. We have a set of six or eight partners that we’ve been working with who are using ONI already – building solutions for our joint customers.
Paige: Ah, you already have it in production, and it’s working.
Doug: It’s one of these cases where it’s unfortunate that there are two convenient things. On the other hand, it’s you know –
Paige: It’s open source. It happens.
Doug: And it could turn out to be a good thing in terms of the evolutionary context. You want to have some competition.
Paige: Well, is there anything exciting coming down the road that you want to talk about?
Doug: There are lots of exciting things coming down the pipeline that I have no idea about, I’m sure.
Paige: [laughing] Okay.
Doug: The other thing that Cloudera has been working a lot on – and you’ll be hearing a lot about this week at Strata – is: we think that people are really starting to move to the public cloud in a big way.
There’s a lot of people staying on their on-premises data centers, but more and more we’re seeing people move to public cloud. We’re trying to see how we can make this open source Big Data ecosystem really work well in the cloud, too. And make that a first-class citizen, and figure out what we need to do to make it a very natural place. And make it easy for people to go back and forth. We’ve got a lot of announcements around that – making that really seamless.
So that’s an exciting thing, and I think it’s important. People love open source because they’re free from a lot of vendor lock-in. That’s where a lot of the attraction is. That’s one of the big reasons why people use open source. Yet, when they go use a cloud vendor like Amazon, they are immediately using the proprietary services that Amazon provides, and that no one else has, and they’re totally locked in to Amazon. They’re sort of destroying all that…
Paige: Open source goodness.
Doug: Yeah. They’re using open source software on Amazon. They’re not using it exclusively though. They’re also using a lot of Amazon’s services which locks them in. So, we’re trying to help people stay on an open source stack for these high-level services, and be able to run them on Amazon and Azure and Google Cloud and also on-premises. So that’s it.
Paige: Well, that’s great. Thank you so much.
Doug: My pleasure.
To hear how organizations can stay on track with their digital transformation, join Cloudera, Dell EMC and Syncsort industry experts on Thursday, December 8, for the webinar, “The Path to Digital Transformation.” to discuss why bigger data equates to bigger opportunities. They will address how best to begin a big data journey by taking control of all data, controlling costs, and identifying the first use case, so organizations can move forward with confidence to transform their business.