Data infrastructure optimization, availability & security software
Data integration & quality software
The Next Wave of technology & innovation

Expert Interview, Part 3: Ted Dunning of MapR on Open Source Community Building and Apache Calcite and Arrow

In the first part of this interview, Ted Dunning, Chief Application Architect at MapR, delved into the differences between streaming data processing engines Apache Flink and Apache Apex. In the second part, he discussed how Apache Storm differs from Flink and Apex.

In this final section of the interview, Syncsort’s Big Data Product Manager, Paige Roberts, learns the secret to building a solid open source community, and the origin of some interesting SQL on Hadoop related Apache projects.

Paige: Are you seeing community support for Flink picking up?

Ted: Yes, absolutely. The community is growing quickly. Apex is growing probably a little bit less quickly, and that’s an unfortunate thing. Because Apex came from a commercial venture, that makes it hard to build community. People tend to have this preconception that those guys are just going to do it. That’s their job. It’s tricky to avoid the impression that …

That anything you contribute is maybe going to be ignored because you don’t work for the originating company?

Yes. Very often these perceptions are just dead wrong, but it’s hard to rewrite them in a community’s head.

For example at MapR, we definitely pushed Apache Drill at the beginning, and because we have a line item on our list which says Apache Drill distributed, people think of that as a product.

They think of it as yours.

Yes. But we think of too much involvement in Drill as a bug, not a feature. We have intellectual property, we have technology that nobody else has, and we call that proprietary. We built it, invented it, and we sell it. But when we want to make things open, we want to spread the platform, we want to make things more interoperable, we want it really open. We make the choice and we go to the limit, either way.

We don’t go halfway with open source, closed community, whatever. But it’s very hard for our competitors, who don’t have that choice to say, “Oh, yeah, we want that too.” They’re worried about not controlling the project and things like that. That’s bad for the platform.

So one of the workarounds we’ve done at MapR is that we helped spawn Apache Calcite, for instance. That was developed from the Drill SQL parser and optimizer, which came from another open source project called Optiq. But there is no product called Calcite. Different companies can cooperate on it, because the marketing department just doesn’t care.

See the results of our 3rd annual Hadoop Survey!

It’s not a product.

The techs care, but they want the best technology irrespective of where it goes. Calcite is now the parser under Phoenix, under Kylin, under Hive. They’ve talked about putting it under Impala. Essentially, every new project at Apache that’s using SQL parsing is using Calcite. From our point of view, that’s a huge win!

We helped build this community, we help build interoperability across all these tools. And we support multiple SQL on Hadoop tools, Hive, Drill, and Impala. We have close relations with the Kylin community. We have close relations with Phoenix community, and so on. But the point is that we now have interoperability and that’s going to build a broader base that benefits everybody. That’s a win.

Yeah, it is.

We’ve recently helped do the same with Apache Drill’s Value Vectors, as well. That’s been spun out as Apache Arrow. Now 13 major open source projects have committers involved in Arrow, including Cassandra, Spark, Python, Kudu, Drill, Parquet and others.

The Hive community has expressed interest lately, but they’re not in yet. What that gives us is a very strong interoperability between these components. They can share data in memory at very high speeds.

That’s awesome.

It is awesome.

That’s the first time I’ve heard of that.

Well, there’s a reason for that. We don’t want anybody to claim ownership of that project. We want it to be a technology project at Apache that benefits everybody. We’ve made sure these are separate projects that will never be on anybody’s price list. So we can do real Apache community building stuff. And that’s kind of cool.

That is very cool.

We want to support these projects and help them out. We want to …

Contribute. Be part of the community.

Right. I love when that happens, I really don’t like it when 90 to 95 percent of the project management works for one company. When that happens, it’s a failure of open source. It’s not a win. It’s not a way to build community. I’ve been involved in open source now for four decades. These things should outlive any given company.

Syncsort’s made some major contributions to Hive and MapReduce, among others.

We’ve got a lot of customers using the software, and they say, “Well, this doesn’t work in Hive and it’s causing us all kinds of problems” and we say, “We can fix that.” But because we don’t have committers, it takes six months or a year or more to get a fix made.

Right. I am a committer on Zookeeper and PMC, and there was a problem with leap seconds. I found a fix for it, published the patch, we had some discussion. But it took three years of pushing to get in. You know what got it in? It was the next leap second coming up.

Oh, that’s funny. [Laughter] It took them four years, it’s like, “Oh, deadline’s coming up.”

And I’m not going to be able to negotiate this deadline. This leap second will happen. It’s good though, that the fixes go in. It doesn’t matter who gets the credit as long as it gets taken care of.

It’s good for the community. Well, thank you for the talk.

No problem. Hope I didn’t talk too much. I tend to get a bit candid.

It’s refreshing.

Combining its long history of innovation with its significant contributions to improve Apache Hadoop, Syncsort designed DMX-h to make big data integration simple. Get started today with a free trial!

Related Posts