Data infrastructure optimization software
Data integration and quality software
Data availability and security software
Cloud solutions

Expert Interview (Part 2): Jeff Bean on Apache Flink and its Available Learning Resources

At the recent Strata Data conference in NYC, Paige Roberts of Syncsort sat down for a conversation with Jeff Bean, the Technical Evangelist at data Artisans.

In the first of the three part blog, Roberts and Bean discussed data Artisans, Apache Flink, and how Flink handles state in stream processing. Part two focuses on the adoption of Flink, why people tend to choose Flink, and available training and learning resources.

Roberts: The thing about Flink is that it’s really well-known in Europe and virtually unknown in the US. Why do you think that is?

Bean: I think it’s just presence. I think it’s because it’s a project mostly out of academia, mostly out of Berlin. I think, had the same project come out of Berkeley or Stanford, it would have taken off like crazy in the US. The best use cases of Flink are folks who have tried something else first and it failed for them, so then they looked around and found Flink.

What pitfalls do people run into with other stream processors that lead them to go to Flink?

That’s a really interesting question. We’re actually still doing research on that, and it’s one of the questions I hear the folks in Berlin ask themselves. The Flink community is growing and we’re building theories on why.

One of the theories is that Flink is inherently more scalable and more fault tolerant, so it can handle more complicated use cases better. I think this is true, but I don’t think that’s it. I also think that Flink is better at the basic essential stuff. There’s a lot of people who believe that, for example, Spark Streaming or Kafka Streams is where they will start because they’re using Spark or Kafka already. Then they’ll discover something silly, like the fact that when you deploy a job from Spark Streaming, and then that client that deployed the job fails, the whole job fails. Whereas, with Flink, you submit a job from the client, the client dies, and then the job keeps going.

You need to study the competitors more before you can understand what it is that you do better.

Yeah. This is what I’ve been running into. It seems like my colleagues at data Artisans are assuming that the other vendors and other projects do things that Flink does, when the fact is anything but. I’m not sure we know what our greatest strengths are, so this is one of the things I’m working on now.

The New Rules for Your Data Landscape

So as the only Flink Evangelist in the US, what do you see as your primary job? What are you going to go forth and do?

I’m currently teaching classes on Flink.

That’s a good start.

I’ve got a docket of private trainings already signed up for the next couple of months, and then we’re going be starting public training in the US where folks can sign up and take Flink classes. That’s number one on my list. Number two is we’re going to start blogging about the strengths of Flink, getting the word out. We have a yearly conference called Flink Forward, which we do in San Francisco, Berlin, and China once a year in each location. The Berlin conference just passed. The San Francisco conference is in April.

Let’s say I don’t live in San Francisco. How would I go about learning Flink? Are there online resources or anything like that?

Yeah, as far as Apache Projects go, the Flink Project is actually pretty well documented for getting started. If you go to flink.apache.org, you can read a lot about it there. Data Artisans also offers its training resources, particularly the exercises for our training courses, to the public at training.data-artisans.com. That site has a set of exercises that you can run through for learning Flink. There’s no need for any special virtual machines or docker images or anything like that. You just need IntelliJ. You run Maven and get the build, and start tinkering around.

Does it have test data or exercise data that you can use?

Yeah, they use the New York Taxi Data that was released as a public dataset a few years back. It finds rides that originate within New York, rides that originate, but have no end, and other types of stream processing stuff. Then the training materials actually introduce complexity to that to simulate real world conditions that Flink is good at, such as handling out-of-order data. The training materials for the Taxi Data will actually shuffle up that file a little bit, and give it to you out of order, just to showcase how Flink can handle event time and things like that.

Very cool!

In the final part of the interview, Roberts and Bean speak about Flink’s unique take on streaming and batch processing, and how Flink compares to other stream processing frameworks.

Make sure to download our eBook, “The New Rules for Your Data Landscape“, and take a look at the rules that are transforming the relationship between business and IT.

0 comments

Leave a Comment

Related Posts