Data infrastructure optimization, availability & security software
Data integration & quality software
The Next Wave of technology & innovation

Expert Interview (Part 3): Livy and Spot are Apache Spark and Cyber Security Projects, Not the Names of Sean Anderson’s Dogs

Shortly after Strata in New York last year, Syncsort Big Data Product Manager, Paige Roberts, caught up with Sean Anderson, who is in charge of product marketing for data science and engineering at Cloudera. In earlier parts of this interview series, he provided a lot of information on Spark 2.0 new features and improvements including, Spark on the Cloud, and Spark Structured Streaming. In this third and final portion of the interview, Sean Anderson dug into two new projects, Apache Livy and Apache Spot.

Apache Livy

Sean Anderson: Recently, we helped build and launch to an open source project called Apache Livy.

Livy is a open-source REST service for Apache Spark jobs and has some great features along the lines of remote snippet and job execution. So I can take a very specific snippet of code, and use Apache Livy as a web interface to get that into my Spark context.

Paige Roberts: Wow. I didn’t know about that one at all. That’s brand new to me. So, is it sort of Oozie-like or … Can you give me some more detail?

Anderson: It’s an open source REST service for Apache Spark. It’s in Cloudera labs right now. It’s up for inclusion at We think it will be included in the Apache projects pretty soon.

Apache Livy is a REST service. It’s specifically valuable for long-running Spark contexts where multiple production jobs are present. It gives you the ability to manage the multiple contexts simultaneously. You can run them on the cluster via YARN using the Livy service if you want to get better fault tolerance. You can submit jobs, and there’s some better security integration for that as well.

Download Now: Bringing Big Data to Life - What the Experts Say

Roberts: How do you see that getting used?

For us, at Cloudera, it’s really all about: How do we make sure that we can iterate and develop on Spark workloads that are existing, without having to take them out of production? Livy allows us to do that in a pretty nice, elegant way.

I’ll have to do more research on that one. I was just talking to Doug Cutting at the Cloudera customer award ceremony, and he told me about Spot which I had not heard of before. It seems like all I’ve got to do is talk to you guys, and get great name ideas for my dogs, and also learn about all the new cool tech! [laughs]

[laughs] Yes! About Apache Spot, it was previously called ONI or Open Network Insight. It’s a pretty cool project. That’s something we launched in collaboration with INTEL that is now incubating.

Apache Spot

Spot aims to be a common platform for cyber security, network intrusion detection with Hadoop as the underlying platform. Traditionally we see SEM systems that are only doing a couple of network end-points. But increasingly, people need the ability to ingest massive amounts of data, and to coalesce that with other sources. So Apache Spot is pretty nice, and it is gaining traction in record time.

So it supports multiple end points, not just particular types of hardware?

Right. In the same way that you can mix sources for an analyst’s perspective, you can do that with Apache Spot. So you may have network flows, you may have DNS, you may have proxy logs. They’re all streaming into a centralized system. There may be some machine learning or some event monitoring that’s happening on that system. Then you have the ability to operationalize those into scoring systems, stuff like that. It really helps you build these robust cyber capabilities.

That sounds like something that’s going to make a lot of IT ops guys happy. So, are these new projects where you guys are putting your energy these days?

Yeah. For Cloudera, things like streaming, machine learning and Spark in the Cloud are going to be big areas of focus. We see this as really robust capabilities that are not only evolving Spark ecosystems, but we now have production customers that demand very specific streaming performance. Or, they have high demands on the amount of machine learning algorithms they can launch. So that’s just going to be a huge focus for us moving forward. We often follow the lead of our customer heroics and we seem them gravitating to streaming and machine learning solutions.

Cool. Well thank you so much. I appreciate you talking to me.

Thanks for including me on this.

See what other experts are saying about the Spark, data lakes and the future of Big Data. Download Syncsort’s eBook: Bringing Big Data to Life, What the Experts Say

Related Posts