At the recent Data Day in Texas, Paige Roberts of Syncsort caught up with Joey Echeverria, an architect at Splunk, and author of O’Reilly book, Hadoop Security. In part one of the blog series, Roberts and Echeverria discussed some common Hadoop security methods. Part two focused on more detail regarding different methods of fine-grained security when dealing with Hadoop.
In the final installment, Echeverria describes what the latest developments are with Splunk, as well as the differences between Apache Spark and Flink.
Roberts: So, switching subjects here, you’re working at Splunk, working a lot with Flink. Can you talk a little bit about what you’re working on there?
Echeverria: Sure. If you’re not familiar with Splunk, it’s basically a log and metrics monitoring platform. It allows you to collect a lot of log data or event oriented data, put it into a very flexible and powerful search engine, and then slice and dice it as you need to. It enables you to do things like determine root cause of IT failures and track security related information.
What we’re building now, though, is a platform to leverage a stream processing engine so you can do that kind of processing at ingest time. You can parse large objects like that, break it apart into the individual event and get those passed over to Splunk for indexing. We also want to make it easier to deal with metric data and extract metrics out of logs that are coming in, or rather pre-aggregate metrics so that you can have summarized indexes stored on the other end. That’s what I’m working on today.
You were talking about using Kinesis and using Flink. Why did you choose Flink in particular? And what are you doing with it?
Apache Flink is a data processing engine and in some ways, it’s the polar opposite of Apache Spark. What I mean by that is the builders of Apache Spark were taking concepts from using MapReduce and Microsoft’s Dryad and coming up with a way of building a modern high-speed batch query engine. They later decided to add the ability to look at data in what they call micro-batches and basically apply that batch query engine to do stream processing type use cases. Flink built a modern stream processing engine, and then later decided, “Oh hey, we can actually wrap this in a batch API and also do batch processing with it.”
I mentioned the ability to do aggregation. If you’re doing an aggregation in a stream processing system that means that the stream processing has to be stateful and to do stateful stream processing you have to periodically check point state to stable storage so that you can resume processing in the place of failures where you left off. Now, both Spark and Flink support the ability of doing that checkpointing, but how they implement it is very different. In particular, the way that Spark implements it is, it checkpoints the entire graph. Whatever you set up as the process that you want to do, it’s checkpointed alongside the data.
Flink does it differently. Flink maintains state per operator and lets users give unique global names to operators. That lets you maintain state for everything that hasn’t been modified. So by using Flink, we get that ability to change pipelines, compile them together with all the other pipelines that our user has configured, and get those optimizations, but not sacrifice state.
I guess the last time I talked to somebody who was doing that kind of streaming process was my friend Ryan who was working on Metron. He decided to go with Storm. What’s the reason why you went with Flink instead of Storm?
That’s a good question. Storm is one of the oldest open source stream processing frameworks. It has a lot of very desirable features. In our evaluation, we found that Flink had on average a higher throughput and lower latency, and that was very attractive to us. To put it in another way, I think that Flink from the ground up, has learned a lot of lessons from systems like Storm and they’ve been able to build those lessons learned on day one, whereas Storm is constantly evolving and trying to learn and improve.
Some of Syncsort’s applications, like DMX-h and Ironstream®, are pushing either directly to Splunk, or to Kafka to integrate with stream processing systems. How do you usually get your data?
Splunk has a number of ingest technologies called forwarders, and the most common way that Splunk today gets data is those forwarders are deployed either on or near systems that are being monitored, and they forward the data to your Splunk indexing cluster. That works fine when all you need to do is pass data to be indexed. If you want to run these sort of stream processing queries or jobs against the data before it comes in, you need an easy way to bring all of that data together, and we are leveraging Kafka and Kinesis for that. We start with a common message bus, whether we use Kafka or Kinesis, depends largely on where the system is deployed.
All right. Well, thank you for taking the time today. This was fun.
Yeah, of course. Glad to help.
Make sure to check out our eBook, 6 Key Questions About Getting More from Your Mainframe with Big Data Technologies, and learn what you need to ask yourself to get around the challenges, and reap the promised benefits of your next Big Data project.