In this Syncsort Expert Interview, Syncsort’s Paige Roberts speaks with scientist, writer, and author of numerous books on big data, Ellen Friedman. The two discuss how Hadoop fits in the industry, what other tools work well for big data streaming and batch processing, and about Friedman’s latest book in the “Practical Machine Learning” series, called “Streaming Data Architecture: New Designs Using Apache Kafka and MapR Streams”.
In Part 1, Ellen discussed what people are using Hadoop for, and where Spark and Drill fit in. In Part 2, she talks about Streaming Data — what she finds the most exciting about technologies and strategies for steaming, including cool things happening in the streaming data processor space, streaming architecture, metadata management for streaming data and streaming messaging systems.
Let’s talk about streaming data … What looks most exciting to you right now, as far as technologies and strategies?
I’m so excited about this topic, I really am.
People start off by saying, “I need real-time insight, I need an analytic tool, I need the right algorithm, and a processor that can handle streaming data.”
I think one of the tools that comes to mind first is Spark Streaming. It’s so popular, and it is a very good tool for working in memory. People say that they use it for real-time. It actually can’t do real-time… It approaches it by micro batching, which is very clever. And over people’s window of what’s required, it is often adequate.
People were going in before and adding Apache Storm, the pioneer in real-time processing. They’ve done a beautiful job. People do say, though, Storm is a little bit hard, a little crankier to use, but it’s worth the effort.
Right now, though, I’m so excited about a new project, a top-level Apache project called Apache Flink. Just as Spark started out of Berkeley, Flink started out of several universities in Europe, a university in Berlin, and in Stockholm. It’s a Ph.D. project, and was called Stratosphere. Then, when it came into the Apache Foundation, it became Flink. Flink does the same sorts of things as Spark Streaming, or Storm. The difference between Spark Streaming and Flink is that Flink is real-time. It isn’t approximately real-time, it is real-time.
They took Spark Streaming’s approach and turned it around. Instead of, “Go from batch and approximate real time by micro-batching,” they say, “Let’s do it the other way around. Let’s do good, clean street-accessible real-time streaming,” and you can have it go back toward batch by changing the time window. That’s a more efficient way to do it.
People say Flink is very developer friendly. It’s a tremendously growing community. It gives people another alternative, but one they can use for both batch and streaming processes. I think it’s great that people now have a selection of tools to look at.
For real-time insight with very low latency, those are great processing tools. There are others, Apache Apex, for instance. There’s a lot of tools. That’s not the only one, or two or three. Look at what fits your situation and your previous expertise the best.
There are some cool things happening in the streaming data processor space.
Definitely. But, let’s move upstream from that. If people are good and clever and safe, they realize that to deliver streaming data to a process like that, you don’t want to just throw it in there. You want some kind of a clue if there’s a rupture in the process. You don’t want to lose your data. So, you begin to look at the whole range of tools you can use to deliver data.
You can use Flume, or others, but the tool that we think is so powerful is Apache Kafka. It works differently than the others. And now, I’m additionally excited because MapR has developed a streaming app called MapR Streams, a messaging system feature that’s integrated into the whole MapR platform. It uses the Apache Kafka 0.9 API. They work very similarly. There are a few things you can do with Streams that Kafka wouldn’t be able to do for you, and the fact that it’s integrated into the whole platform. As I said, it simplifies things. But at the heart of it, they are approaching this the same way. I think MapR Streams and Apache Kafka are both excellent tools for this messaging.
But I want to talk to you about something more fundamental than the technologies, and that’s really what our book “Streaming Architectures” is about.
The architecture for streaming.
Exactly. Instead of just talking about tools, what I do in the book is to talk about what are the capabilities you want to look for in that messaging layer to support the kind of architecture that we think makes the best use of streaming data. Because right now, those two tools, Apache Kafka and MapR Streams, are absolutely the tools of choice. But people constantly develop new tools. So, it’s not about this tool or that tool. It’s about what does a tool need to do for you? Do you understand it’s capabilities and why they’re an advantage? If so, you’ll recognize other good new tools as they get developed.
So, what do you feel are the capabilities to look for in a good streaming messaging system?
I think the big idea is that it’s not just about using streaming data for a single workflow, a single data flow, toward that real-time insight. The value of the messaging layer technology and the value of that right architecture goes way beyond that. It’s much broader.
Kafka and MapR Streams are very scalable, very high throughput, without the performance being a tradeoff against latency. Usually, if you can do one, you can’t do the other. Well, Kafka and Streams both do them both very well. The level at which they perform is a little different, but they’re both off in a class almost by themselves. They also have to be reliable, obviously, but they’re both excellent at that.
Another feature to look for is that they need to be able to support multiple producers of data in that same stream or topic, and multiple consumers or consumer groups. They both can be partitioned, so that helps with load balancing and so forth.
The consumers subscribe to topics and so the data shows up and is available for immediate use, but it’s decoupled from the consumer. So these messaging systems provide the data for immediate use, yet the data is durable. It’s persistent. You don’t have to have the consumer running constantly. They don’t have to be coordinated. The consumer may be there and use the data immediately; the consumer may start up later; you may add a new consumer.
That decoupling is incredibly important.
It makes that message stream be re-playable. What that does for you is make that stream become a way to support micro services, which is hugely powerful. Both Kafka and MapR Streams have that feature.
Back to the idea of flexibility that we discussed earlier. These message systems work for batch processes as well as streaming processes. It’s no longer just a queue that’s upstream from a streaming application, it becomes the heart of a whole architecture where you put your organization together.
You can have multiple consumers coming in and saying, “Oh, you were streaming that data toward the streaming application because you needed to make a real-time dashboard, blah blah blah. But, look. The data in that event stream, is something I want to analyze differently for this database or for this search document.” You just tap into that data stream and use the data, and it doesn’t interfere with what’s going on over there. It’s absolutely a different way of doing things. It simplifies life. Both Kafka and MapR Streams support all of those features and architecture. And I think this is a shift that people are just beginning to relate to.
Shifting to a new way of thinking and building can be difficult.
One of the nice things about this decoupling and the flexibility of using a good messaging system is that it makes the transition easier, as well. You can start it in parallel and then take the original offline. A transition is never easy to do, but it makes it much less painful than it could be.
MapR has one aspect that’s different. It’s a very new feature. It goes one step further, and actually collects together a lot of topics that go up to thousands, hundreds of thousands, millions of topics into a structure feature that MapR calls the Stream. There isn’t an equivalent in Kafka. The Stream is a beautiful thing. It’s a management level piece of technology. You don’t need it for everything. But, if you have a lot of topics, this is a really great thing.
Kind of a metadata management for streaming data?
Well … for the topics that you want to manage in a similar way, you can set up multiple streams. There may be one topic in a stream, there may be 100,000 topics collected into that stream. But for all the ones that you want to manage together, you can set various policies at the Stream level. That makes it really convenient. You can set policies on access, for instance at the Stream level. You can set time-to-live.
People should get it out of their head that if you’re streaming data it’s going to be a “use it or lose it” kind of situation, that the data is gone because you don’t have room to store it. It doesn’t mean you have to store all your streaming data, it just means that if you want to, you have the option. You have a configurable time-to-live. The time-to-live can be…
Seven seconds or seven days …
Or, if you want to just keep it, you basically set that date to infinity and you’ve persisted everything. You have the option to do what you want, and you can go back and change it, too. You can set those policies. With MapR Streams, you don’t have to go in and set it
To 100,000 different topics.
Right. You can do it collectively. Say, that whole project, we want to keep for one year, and then we’re done. Or, that whole project, we want to keep it for one day, and we’re done. You can set access rights as well at the project level. You can set who in your organization has access to what data.
This group can access this project, this group can access that project, but they can’t access this other data.
That’s right. MapR Streams gives you that additional advantage.
And here’s another, different way of thinking about streaming architecture. MapR Streams has a really efficient geo-distributed replication. Say you’re trying to push streaming data out to data centers in multiple places around the world, you want that to happen right away, and you want to do it efficiently. You just replicate the stream to the other data center. It’s a very powerful capability, and
That is organized at the stream level, as well, so again, you might say, “These topics, I want the same time-to-live or I want the same access, but these I want to replicate to five data centers, and these ones I don’t. So, I’ll make two different streams.”
It’s a good management system. These are elegant additional features, but I think at the heart of it, even if you don’t have that capability, you still have the capability to bring a stream-first architecture to most systems. Then, streaming isn’t the specialized thing, it becomes the norm.
You pull data away from that into various kinds of streams, and decide, “I’m going to do a batch process here, a virtualized database there, and I’m going to do this thing in real time.”
Right now, Kafka and MapR Streams are the two messaging technologies that we like, but it doesn’t mean they will be the only ones in the field. That’s why I think it’s important for people to look at what the capability is, rather than just looking at the tools. A tool may be the popular tool now, but there may be even better ones later.
Is there anything else people should keep in mind relating to streaming architectures?
In this whole industry, looking at how people use anything related to the whole Hadoop ecosystem, I think future-proofing is something you need to keep in mind. People are very new to this, in some cases, and one thing they can do by jumping in and using any of these technologies is build up expertise. They’re not even sure, in some cases, exactly how they want to use it. The sooner they begin to build expertise internally, the better position they’ll be in by the time they want to put something into production.
On Friday, in Part 3, Ellen will talk about her book, “Streaming Data Architecture: New Designs Using Apache Kafka and MapR Streams”.