Hortonworks' Yolanda Davis on Ten Years of Hadoop and Apache Nifi

Expert Interview (Part 1): Hortonworks’ Yolanda Davis on Ten Years of Hadoop and Apache Nifi

Hadoop Summit in San Jose this year celebrated Hadoop’s 10th birthday. All of the folks on stage are people who contributed to Hadoop during those 10 years. One of them is Yolanda Davis, Senior Software Engineer at Hortonworks.

Celebrating 10 Years of Hadoop at Hortonworks Summit

Yolanda and I worked together on a Hortonworks project last year. She was in charge of the user interface design and development team. I caught up with her early in the morning of the last day of Hadoop Summit, and quizzed her on this new project she’s working on that you may have heard of, Apache Nifi. As promised, here is my interview with her on the subject of Nifi and the new HDF (Hortonworks Data Flow) streaming data processing platform, which includes Nifi, Apache Kafka and Apache Storm.

Just to give you an idea of what she’s been doing lately, I’ll start it off with a few Tweets from the end of that day when there was a Birds of a Feather session for all things Nifi and Streaming.

Let’s start with an introduction. What’s your new position?

I’m a Senior Software Engineer. I work with Hortonworks specifically HDF engineering, so Hortonworks Data Flow products and framework, which is powered by Apache NiFi. The goal of that framework is to help deal with the whole data ingress/egress problem. A lot of people are just trying to get their data in, trying to get high quality data, so they can go ahead and process. How can they get there quickly? That is what HDF is helping to resolve: How can we get this data from the edge, process and transform it to a form that we can start doing some real analytics on it. That is what the framework is designed for, and I’m part of that team.

HDF is all about streaming.  So you guys are working a lot with data in motion, right?

Exactly. It’s all about dealing with the problem of data in motion, at play and at rest. We have NiFi that helps gets you data from multiple, whatever type of resources. If you have like a REST service you’re trying to pull in, absolutely you can do that. Your typical, regular data stores or regular database, you can pull them in that way. Or, we have the newer project that’s on the community called MiNiFi that’s helping us extend the edge even further. So now, whether it’s a particular device that you need to track, a sensor or some other device out in the field, MiNiFi will help resolve that problem. It has an agent that’s out there getting that data from that device and transporting it back to NiFi to do the rest of the processing. So that’s your data in motion.  And the whole deal is not only do you want to capture that data, but you want to see what happens to that data along the way, which is another problem.

Tracking and lineage.

Yeah, tracking the change. What happened when you received it? What did it look like? What are the attributes of that data? What was processed along the way? How do we track it? That’s the provenance data.

All along the way as we process through NiFi, there is a record that’s kept, so you know what happened at this place and time. It’s all based on flow programming. Part of it is that once you reach this step, the data that comes in and gets transformed, it doesn’t care what happened before or after. It spits it back out with the information left behind of, “This is what occurred.”

It’s great for especially those governance challenges where people want to see the history of data, of what happened to it, or even replay what happened in the flow. That is a critical piece to putting that picture together of your whole data journey.

Yolanda Davis

How does it share that tracking and lineage data? Does it integrate with Apache Atlas?

That’s going to be the end game. Atlas will use that provenance data. The whole goal is for that provenance information to be fed into Atlas. Then that can help create a larger picture for you.

I’m working with the Atlas product manager. We’re talking with the PMs from Atlas and from NiFi. Basically what we’re bringing into the game is mainframe data.

Ah, okay. So in terms of connecting through some messaging service or —

We do Kafka.

Yeah, well that’s part of that stack, right? A critical part.

Yeah. It seems like Kafka is becoming almost like a backbone of the stack.

Yeah! It is a backbone. It’s a commit log. It can be supported with HDFS or not, which makes it really, this is just my opinion, but I think it makes it attractive to a lot of people who might not be ready for the whole HDP [Hortonworks Data Platform] yet. They might not have clusters in play. But the thing with HDF, it is a great introduction, right?


If you want to get there and get there quickly. So, as you know, in a previous life, what was I doing to get data?

Yeah. And that was not quick. [laughter] 

It was not quick.  The other challenge, too, was you had to have developers. You had to have a deployment model. And then, if you had to make a change, then you had to have another developer come in and make that change.

With the whole interactive command and control in HDF, it helps eliminate that need. You can create your flow, your environment within this UI, and put it out there without having to deploy some code. That helps to not only eliminate the need for a developer in order to make this happen, but also, you can have different levels of people interacting with the system. Whether it’s testing out within a small test environment or your operational folks, it makes it more accessible.

Then, you get to see real-time live what’s going on with your data, and you can make changes real time live if you need to. That’s what makes it awesome, I think, making it more accessible. You see, even here [at Hadoop Summit], the presentations where, not only were people able to quickly get their jobs done, but also, NiFi has a lot of the controls that you need for guaranteed delivery, and solving the problems that we talked about before – things that people look for in their production environment. The framework comes with them.

So it expands your user base, gets your time to value shortened, and gives you a lot of enterprise features without having to sit down and write code.

Exactly, exactly.

Well, that’s pretty awesome.

In part 2 of our conversation with Yolanda, we’ll discuss the barriers to women in technical fields, and some of the good organizations that can encourage young women getting into this field.

Paige Roberts

Authored by Paige Roberts

Product Manager, Big Data

Leave a Comment