Data infrastructure optimization, availability & security software
Data integration & quality software
The Next Wave of technology & innovation

Expert Interview (Part 1): Jeff Bean on Apache Flink and How it Handles State in Stream Processing

At the recent Strata Data conference in NYC, Paige Roberts of Syncsort sat down for a conversation with Jeff Bean, the Technical Evangelist at data Artisans. In the first of this three part blog, Roberts and Bean discuss data Artisans, Apache Flink, and how Flink handles state in stream processing.

Roberts: So can you tell our readers a little bit about yourself?

Bean: Sure. My name is Jeff Bean. I’m a Technical Evangelist at data Artisans. I’m the only US Technical Evangelist and I’m responsible for teaching and training about Apache Flink and spreading the word about Flink to the broader community here in the US.

Flink has been around for a while, and they’re a pretty cool execution framework, but there isn’t a lot of information in the US. Where are you based out of?

The company is headquartered in Berlin. I’m based out of the Bay Area.

Okay, so you’re here on the west coast of the US, how did you get involved with Flink?

I was at Cloudera for almost eight years. I ran the ISV partner product certification program at Cloudera and I was on the education team among other roles.

Which is how you know Syncsort [laughs].

Which is how I know Syncsort! Right. I handled Syncsort’s certification on CDH 5.x. When I left Cloudera, I talked to probably half the vendors at this conference about my next thing. data Artisans was compelling because they were small and because Flink was clearly a very promising project from a commercial standpoint. It had a lot of adoption, but not a lot of commercial market awareness. data Artisans had a real need for someone to spread the word in the US, and it was just kind of a natural fit. There’s a real hole across the market in real-time stream analytics on distributed platforms that Flink meets very well.

I’m also really interested in the way certain open source projects take off for what I presume must be social reasons, like Spark Streaming largely took off because it’s out of Berkeley and it’s co-located with Silicon Valley, rather than more practical, technical reasons.

The New Rules for Your Data Landscape

I certainly think that part of the reason Spark has really taken off is the community support. There’s a lot of contributors, and really solid APIs. That makes a big difference in adoption rates. How is Flink on the API front?

Flink’s API is extremely expressive, and it handles problems that you don’t solve in other ways. Flink is stateful, for one. You have in your API the ability to take events, put them in state or put other information in state, and then perform more complex business logic. When I was teaching MapReduce, Hive, or data analytics at Cloudera, I would always say that Big Data systems are “distributed shared nothing and stateless.” The more state you try to manage in your application, the more likely it is that you’re violating some best practice, and the more likely it is that you’re not going to be able to scale.

Flink provides an API that can describe state, and describe time distinguishing between time perspectives such as event time and processing time. It describes events flowing through the system in a way that allows developers to reason about it, and make decisions about how we handle things like out-of-order data, and how we can perform complicated aggregates using state. I think it’s very technically compelling, and also very interesting from the business standpoint as well.

What language would I use if I were to write a process in Flink?

It’s Java and Scala today.

Just like Spark?

Just like Spark. We hear a lot of requests for Python support. We support the Apache Beam interface, so you can get your Python support through Beam.

So Flink is a Beam runner?

Yeah. I got concerned that the first thing you do if you have an issue is remove abstraction layers, so Beam was questionable to me for that reason, but people are doing it.

You almost want to build your thing, test, refine, make sure it works well, and then put it in Beam.

Right. So for now, it’s Java and Scala, preferably Java.

Okay. And eventually Python, maybe?

We’ll see.

Check back for part 2 of the interview when Roberts and Bean speak about the adoption of Flink, why people tend to choose Flink, and available training and learning resources.

Also, make sure to download our eBook, “The New Rules for Your Data Landscape“, and take a look at the rules that are transforming the relationship between business and IT.

Related Posts