Syncsort Product Manager, Paige Roberts, took advantage of office hours at the MapR booth at Strata + Hadoop World to pick Ted Dunning’s brain on a variety of Hadoop software related questions, from what he thinks of some of the streaming engines to the different trends in Hadoop across the globe.
Ted Dunning (@) is Chief Application Architect at MapR and the author of several books available from MapR, O’Reilly and Amazon such as Real-World Hadoop, Sharing Big Data Safely and Streaming Architectures.
For this first part of the interview, Paige dug into some of the differences between modern streaming data processing frameworks.
Paige: I recently read your book Streaming Architectures, and I wanted to learn more about your view of some of the streaming engines available right now. There’s a lot of new choices, Apache Flink, Apache Apex and Apache Beam in particular, plus Apache Storm is the old standby. I know Beam is more of an API approach.
Ted: It’s at a higher level.
Right. What do you think are the advantages of say, Flink versus Apex?
You know, it just isn’t clear yet. In terms of market, Apex has a little bit better recognition in the US, but not much. In customer surveys, in contrast, Flink has probably four times the name recognition today. But, Apex has a bit of an odor from having been a commercial offering for a couple years already.
Did they have a different name before?
That’s the company.
Right. That’s the company that developed the Apex software. They also have a large connector library. It’s called Malhar, Apex Malhar. They have some small differences in technical details. I have no idea what the practical impact of those differences is. The thing that sets Apex and Flink apart from the rest is they both have a very good snapshot capability, a check pointing capability. They’re based on the Chandy-Lamport algorithm. It allows a streaming snapshot to be taken of the computation without pausing the computation. So that’s pretty cool.
That is cool. Yeah.
One of the things Flink has demonstrated is how it can take multiple streaming snapshots at the same time in the same computation. You can have multiple checkpoints being taken. Checkpoints are the data that you store in order to resume the computation without any data loss.
So, if something crashes or fails, you’ve got a start over point.
Or, if you just want to start over at a certain point for de-bugging purposes. There’s two ways to do that under streaming computation, and one is to stop everything. Get everybody to stop. Get everybody to write the current state. That’s really bad. Because it takes tens of seconds sometimes.
And you get data backing up, and …
What’s worse is you may not actually have a consistent join state. So, it may look simpler, but it’s really, really bad.
Yeah. The other option is to put markers into the input streams. Let’s say we’re going to do a checkpoint. Everything is either before this or after this. As those markers propagate through every node in the computation, as soon as it sees markers on all its inputs, it checkpoints itself and admits the markers. And so the checkpoint process flows through the computation just like the data flows through the computation. That lets you do this without stopping anything. It’s still processing, data is still coming in.
And Flink does that? Or Apex?
Flink and Apex both do that. Both of these are essentially the same algorithm. Flink exposes these checkpoints as what they call “save-points,” which are externally accessible. So you can use those for post-mortems. Or, you can use them to start a clone of a running streaming process. Even to start a clone of a process on a different version of your software, for de-bugging purposes or Q&A, or just to be doing a transition of an algorithm to another known stop.
Does it give you intermediate data sets, so if you want to say, “Well, before I did this computation what did the data look like? Or, let’s do this computation differently from this point forward”?
Yes, you could do that. It also does checkpointing in an incremental way. Flink does. I’m not sure about Apex. By doing this in an incremental way, you don’t stop and then write all this data out. You only write out the things as they change. That gives you again an appearance of no delay at all for this checkpointing process.
It’s just awesome. And earlier systems just couldn’t do this. And this is all tied into the exactly once guarantees. Now you can’t really do exactly once in a distributed system because distributed systems are distributed and complicated. But, what you can do is you can set it up so that any data that gets written, even if it’s written more than once, it will write exactly the same value. So, it isn’t exactly once. It’s just that you can’t tell the difference. You get the same results.
The outcome is the same.
It’s kind of an interesting philosophical difference.
If a tree falls in the forest … If you do a calculation more than once, but you get the same results … [laughing]
Exactly. If I write the right answer twice, is that okay?
So, these new systems are really good. I mean, Storm was really good when it came out because we had nothing, nothing useful there. Spark Streaming made that a little better, but it’s micro-batching and it just isn’t suitable for a lot of applications. Now we have the latest generation of things which is true streaming. Of course, it’s easy to go from streaming to batch because a batch is just a set of finite streams.
Right. Chopped up streams.
Now you can do different optimizations in batch cases like Flink does. You use exactly the same APIs. That’s just very, very cool. So we’re seeing some important customers beginning to move into these new generation of streaming algorithms with Flink.
So, you’re seeing a lot of customers moving towards Flink?
In Europe. It’s a European project and there’s much better awareness there. We’re definitely seeing a lot of customers at least trialing Flink, and several of them are beginning to want to put it in production. Putting it in production is grated to some degree on whether or not a company like MapR supports it. That’s a chicken and egg thing, because we’re very customer driven.
So, your support is driven by the fact your customers want it, and your customers can’t use it until you support it. [laughing]
Yeah. We’ll probably make a decision on exactly which way we go in the next few weeks, but I think it’s important to support this kind of computing, at least by informal means. Certainly, right now, you can talk to data Artisans, you can talk to DataTorrent, and they can support it on MapR. The fastest Flink has ever run is on a MapR cluster. It’s screaming fast.
In the second part of this conversation to be posted tomorrow, Ted Dunning and Paige Roberts compare Storm to the more modern Flink and Apex. How are they different? When is Storm still the best option? In the third part, we’ll discuss the mechanics of good open source community building, and why that can be trickier than it sounds.
In part 2 of the three-part interview, Ted contrasts the advantages of new platforms like Flink versus the old streaming standby, Apache Storm… stay tuned.
For more information, get the DMX-h Quick Start Guide for MapR VM.