This blog was originally posted on the “Big Data Page by Paige” Blog.
In June at Hadoop Summit, I caught up with some friends from the project I worked on last year with Hortonworks, including Ryan Merriman who is now an Apache Metron architect. Since Apache Metron was a project I knew virtually nothing about beforehand, I quizzed Ryan about it. The conversation evolved into a discussion of the merits of Storm versus Flink and Heron, something I’ve been meaning to delve into for months here.
As I mentioned in my previous post, I’ve done some interviews with some interesting people at the last couple of big data events I attended. My interviews with Spark expert Holden Karau of IBM, Steve Sarsfield of HPE Vertica and Ellen Friedman of MapR are live on the Syncsort blog. The CTO of Hortonworks, Scott Gnau’s interview just went up live today.
Ryan Merriman was the guy with the amazing good sense to recommend hiring me for the Hortonworks contract job I did last year. He also has the good sense to live in Austin. We spent a lot of time travelling to the project site and back every week for months, and got to be friends. Now, he’s an architect on the Apache Metron project.
He sat down with me at Hadoop Summit and talked shop for a bit.
Paige: So talk to me about Apache Metron.
Ryan: Okay. I’ll just give you the end to end architecture.
Paige: Sounds like a good place to start.
Ryan: So step one, we have these various sensors that pump data into Kafka in different topics. The next step is a Storm topology that takes raw data and parses it into JSON with guarantees that there’s like seven or eight fields that will be there, that we use downstream. Once it’s parsed, the JSON gets written back out to Kafka, and they all feed into one big Storm enrichment topology. So, the idea there is that through configuration, you can enrich certain fields. An example of that would be enriching an IP address with location information. Another example would be that you could take a domain and get information about who it belongs to.
So where do you get that information? Is it stored in a lookup?
There are several different ways to do it. Most common is for that information to reside in HBase. It’s like a streaming pointer.
You do a quick query?
Essentially. Of course, it’s all backed up. I mean I’m not getting into the implementation details, but every enrichment is backed up by local cache to make it fast. The enrichments are the first step. They all happen in parallel. So it’s like a streaming split and joint.
Then the next step of the process is threat intelligence. So, there are several different feeds and sources of information about malicious IPs. This step works almost exactly the same as enrichments where you’re giving it an IP address. Now, it’ll check against the threat intel list.
And say, “Oh, that? No. Not that IP.”
Yeah, exactly. The last step of the process is it gets indexed into a search engine. So that could be either Elastic Search or Solr, or both.
We’ve got a Kibana front end on Elastic Search right now. We’re in the process of building UIs, a custom UI, and configuration UI. Eventually we’ll build the UI that will replace Kibana. There’s also PCAP support so you can capture raw PCAP.
I don’t know that acronym.
Packet capture. It’s the raw data that is transmitted over the network.
Yup. So we store that in HDFS and have a little query interface that you can replay the packet.
Where you can check trends and patterns?
Yup. That’s pretty much it now. That’s the current state.
So that’s all IOT source data coming in, sensor data, and telemetry data from your data center, things like that.
Yes, but specifically for security. Metron isn’t a general IOT streaming data processor. It processes IOT data specific to the cyber security domain.
So you’re not going to look at any other kind of data movement or transaction?
Well, eventually. This is step one. This is what OpenSoc did. This is feature parity with OpenSoc. The next phase is going to be data science. We’re going to facilitate building models, detecting patterns, you know, predictive analytics, that type of stuff.
Checking for a normal pattern so you can recognize an abnormal pattern, that kind of thing?
Yup. Because right now it’s all rule based and a lot of these sensors are rules. If this, then this. It does not scale.
And it changes so fast.
So in order to react quickly, you need to move to machine learning. And then eventually down the road, people can share models with each other and it will be like a model marketplace where people can reuse.
That would be really cool.
That’s way off though. That’s… Who knows?
I just interviewed Holden Karau at Strata and she was talking about one of the shortcomings of Spark is that you can train a model really well. It’s really good at that. But then you can’t export it easily to anything else to use your model.
Yes, serializing models, that’s a difficult problem. We still have to figure that out. MaaS may help with that.
MaaS? We’re not talking about Metal as a Service, are we?
No, Model as a Service, actually. Casey Stella is the Metron Analytics lead, working with MaaS. It’s a custom YARN application focused around deploying model scoring services, and discovering deployed model scoring services with Zookeeper. This allows you to create and train a model and have multiple instances of that model be deployed and have separate components find where those models are being served in your Hadoop cluster and use them. The current architecture assumes your models are exposing RESTful endpoints.
Cool. That sounds exactly like what is needed. Is MaaS part of Metron?
Yes, MAAS is a sub-component of the Metron Analytics component. Metron is essentially an IOT streaming application for all aspects of cyber security.
So is it going to just drop right into HDP? Is it going to be part of the stack?
Well, the idea is for it to be agnostic about which stack you use.
Okay. So you could drop it into Cloudera or MapR?
Yup, eventually. Or just native straight up Apache Hadoop. You just need to have the right pieces. You’ve got to have Storm. You’ve got to have Kafka. You’ve got to have Elastic Search. You’ve got to have Zookeeper.
So, you’re using Storm for your processor. What made you guys pick Storm?
That’s funny. Somebody asked that question. We just gave a talk. Somebody else asked that exact question.
So, why did you choose it over Flink or Heron?
Everybody keeps asking about Flink. Well, the legacy OpenSoc, it was built in Storm. So our legacy code was already working. That’s a pretty compelling reason to use Storm, right?
What? You don’t want to have to rewrite everything from scratch? [Laughter]
We do have some requirements around complex event processing in a join. Which is not something simple. So we have two requirements. It’s fairly complex and it’s got to be fast.
Super-fast. As fast as possible.
You need really real real-time. I did a blog post called the Four Meanings of Real-Time. There’s sub-second response time, real, real-time. And then there’s all these other things that people mean when they say real-time.
Not just real-time, real real-time. [laughter] Also, as far as the non-functional requirements, our team has a lot of experience in Storm. We have no experience in Flink.
Yeah, that helps.
Hortonworks doesn’t support Flink, yet. That’s another thing.
Capitol One, I think, are the poster child for Flink. I said something in a presentation about Flink. “No one is using it in practice yet. Flink is a science experiment so far.” And Asokan, our chief architect said, “Well, Capital One is using it.” They were the first I heard about.
So what’s interesting is Capital One is using Metron and Storm.
They may be using Flink for something else, but they’re not using it for this use case. So it’s probably just because we haven’t built it in Flink. I don’t know.
Well, they’ve got a big stack. They’re using a lot of tech, and doing a lot of really cool stuff.
They’re huge. Yeah.
They’re presenting on two or three different things here at the conference.
I’m not opposed to using Flink. It’s just not a priority because what we have works.
I asked on Twitter a while back, around Strata time period, “Why did Apache NiFi decide to support Storm instead of Flink or Heron?” Holger Mueller, who is an industry analyst, retweeted the question with, “Yeah, I wonder, too.”
Everybody I’ve talked to who uses Storm was frustrated with it, and Heron was invented to fix the problems that caused that frustration. Tim Hall, who is a VP of product management at Hortonworks, answered me on Twitter. He said, “Heron is open source, but not yet open community via ASF. Many Heron improvements are already in Storm.” And about Flink, he said, “We already support Apache Storm. Apache Flink is emerging and interesting.”
That’s what I’ve heard is that people aren’t sure if Flink is as hardened as Storm is.
It’s so young.
Well, and the reason Spark is awesome is because close to a thousand people contribute to it.
Exactly. It’s got a ton of momentum.
Flink could become just as awesome, but only if it gets that kind of community support.
Storm works fine for me. I have no problem with Storm. It works great.
I think it had some scaling issues early on. Heron basically came out because it was Twitter’s way of fixing the things that were wrong with Storm, and everybody just went, “Why don’t we just put that in Storm?”
Why didn’t they fix Storm?
Yeah. [Laughter] That was pretty much the opinion. So I gather that they took a lot of the cool stuff out of Heron and put it in Storm.
We haven’t had a reason to say, “Oh, let’s ditch Storm and use something else.”
If it’s working, why mess with it?
It works fine so far.
Storm is still, as far as I know, the most popular streaming processor.
Yup. Although, you know, Storm is pretty hard to use.
That’s the other thing I’ve heard about it from everyone.
It’s complex. You have to know what you’re doing. It’s not a short learning curve. There’s a lot of AHA! moments and gotchas. It’s not as easy as it should be.
I wonder if that’s one of the things Flink is trying to solve.
It could be.
That and they’re trying to do batch and streaming together. It doesn’t make sense that you need two completely different things to do that.
Yeah. Storm is very low level coding, too. You’re not using a high-level language. It’s like Java.
Thanks for the info.
Yeah, you got it. Anytime.
Metron is new tech to me. And that’s a big part of my job these days, learning about new tech.
That sounds like an awesome job. [Laughter] Nice.
I can’t complain.
So, straight from the architect’s mouth, you now know the basics of Apache Metron, how it works to keep your network secure, and why it uses Storm.
As far as the other streaming processing engines, I understand Flink has a higher community involvement overseas. It’s just starting to gain traction in the US, but there’s more momentum on the other side of the pond. I also got an intro to Apache Apex, a Flink challenger, at Hadoop Summit. Guess where I heard about Apex? In a Capitol One presentation. I’m not sure what they’re feeding their developers over there, but they seem to be going like the energizer bunny. (They did an interesting co-presentation with Yahoo on performance comparisons between Flink, Storm and Spark Streaming, too.)
The concept of a single execution engine for both streaming and batch loads is definitely one with legs. Apex and Flink are the pioneers, but there’s no telling who will end up taking Spark’s title as the next processor you simply have to use. Or, if Spark will evolve into something even more capable than it is now, and become its own successor.
In the meantime, Storm is still the streaming data workhorse. There was a birds of a feather streaming chat session towards the end of the conference, and I heard glowing admiration for Storm’s capabilities from the Apache NiFi team. I have never heard anyone say anything about it being easy to use, but if you’re willing to put in the extra development hours, and you have the expertise on your team, Storm’s a solid bet.
I also caught up with another friend at Hadoop Summit from that same Hortonworks project I worked on last year, Yolanda Davis. She was in charge of the user interface development team on that project. Now, she’s kicking butt on the Apache NiFi dev team. I ran into her at the Women in Big Data lunch. So, my next post will be a chat with her, will explore some interesting facts about Apache NiFi, and some good references for encouraging young girls to get into coding.
See you then.
See the results of Syncsort’s third annual Hadoop Survey in the free eBook: Hadoop Perspectives for 2017.