In this Syncsort Expert Interview, Syncsort’s Paige Roberts speaks with scientist, writer, and author of numerous books on big data, Ellen Friedman. The two discuss how Hadoop fits in the industry, what other tools work well for big data streaming and batch processing, and about Friedman’s latest book.
Tell me about your book.
Well, I just did! [laughs] The content I was talking about [in Part 1 and Part 2 of this interview] is kind of the heart of the book. But, there is more. The book is called “Streaming Architecture: New Designs Using Apache Kafka and MapR Streams.” It’s a book that should work well both for people who are actually the ones technically building these systems, and for people who are not. That’s the approach we take with all six of our books.
High level for the business person, and then drill down into the code for the technical person?
Right. It helps the very technical implementer because it gives them a chance to think about the basics behind what they’re doing. They don’t always have the time to do that.
We talk about why people use streaming and give a number of use cases. We talk about the stream-based architecture that I just described to you and why the messaging system is very important and how it can be used differently.
The third chapter is all about micro services … what the concept of micro services is, why that’s useful, why organizations that move to that style have seen a lot of success with it. You don’t have to do streaming, obviously, to set up micro services. Stream is a new way to start micro services, and I think sometimes people are surprised to realize it does support streaming. We explain how.
The fourth chapter is called Apache Kafka, and we explain the power of Kafka, how it works, templates, some sample programs … Chapter five turns around and does the same thing with MapR Streams. Then we have a couple of chapters that just take specific use cases. One is an anomaly detection case. The book shows how to build it using Stream system architecture, and why that could be an advantage to you.
The last use case … [laughing] I’m laughing because one of our figures has a little joke built into it, but it’s using the example of IoT (internet of things) data, looking at container shipping, just a mind-boggling scale of data to transport …
Ted Dunning used that as an example in his talk. (At Strata + Hadoop World 2016)
Well, I was at Strata Singapore in December. I was on the 22nd floor of some building meeting with a customer but I was distracted and looked out the window, and I could see the container port. A huge percentage of the world’s container shipping goes through there. I’ve written about it before, but I’ve never been there before. Staring out the window there looking at the scale, the sheer amount of ships … it’s like your brain melts. It’s just stunning. When you think that all those containers can be just covered with sensors that are reporting back. There’s sensors on the ship. You can have an onboard cluster. You can stream data to that cluster. It can then stream data to a cluster at the port, which is maybe owned by the shipping company, so they’re tracking what’s happening with their stuff. They can send that data around the world.
Like to the port authority.
…who is interested not just in that one company at that one port. The port authority is interested in what’s happening in all the ports. That’s where the geo-distributed feature of MapR Streams comes in. Then the ship leaves, loads up its stuff, and chugs off to the next port. While it’s at sea, it’s collecting data about what’s happened on its on-board cluster. I’m not saying everyone’s doing this right now. I’m saying it’s the potential of what we see happening. Meanwhile, that shipping cluster the company has in Tokyo can be, with the MapR Stream replication, sending that data to Singapore before the ship ever gets there. So, now Singapore has an accurate record of what’s supposed to be coming in on the ship. The ship comes in and says, “This is what’s happened while we were at sea. Let me update you about what’s happened”. It’s this beautifully synchronized thing.
Pretty amazing. We live in interesting times.
I think we do. I just find that to be a mind-boggling example, even more so because … I could see the scale, see all those ships and all those containers. I just thought, “Oh, my God. What a huge job” I tell people, “If you read the book, you have to look for that little Easter Egg of an example.”
At the end of the book, we talk about if you are interested in taking this style of approach, with Apache Kafka and MapR Streams, how do you migrate your system? It gives some pointers for how to do it. MapR has the rights to the book for something like 30 days, so they are giving it away, doing a book signing here as well. MapR has it available online for free download. I know there is a .PDF and I think they are also sending it as an e-book, which is a little easier reading. The others books published by O’Reilly are available as free PDFs at MapR.com, which includes the series called “Practical Machine Learning.” Two are set up as an active e-book.