Data infrastructure optimization, availability & security software
Data integration & quality software
The Next Wave of technology & innovation

Expert Interview Series – Part 1: Scott Gnau On the Hadoop Toolbox, Spark, and Future-Proofing the Hadoop

Hortonworks has been a major force in shaping big data and data analytics into what it is today. In this interview with Hortonworks CTO, Scott Gnau, he talks about future-proofing, adding tools to the Hadoop ecosystem toolbox, and the advantages of using streaming and batch together in a connected data architecture.

Hortonworks has been a major force in shaping big data and data analytics into what it is today. In this interview with Hortonworks CTO, Scott Gnau, he talks about future-proofing, adding tools to the Hadoop ecosystem toolbox, and the advantages of using streaming and batch together in a connected data architecture.

At Hadoop Summit, Syncsort’s Paige Roberts sat down with Hortonworks Chief Technology Officer, Scott Gnau. In this interview, Scott talks about what Hadoop is, from his perspective, how it fits into the wide world of big data analytics, and what’s coming down the pipeline that’s particularly exciting. Join us!

What do you think Hadoop is for?

Scott Gnau: Really, you’re going to ask me that?

I ask everybody that one.

First, Hadoop is not one thing, it is many things. It is an ecosystem. That being the case, it is for many things, as well. The way I think about it is like a tool box for capturing, analyzing, and moving data, with many different tools in it.

What do you think of Spark?

I’m really glad that it’s one of the tools in the toolbox. I think Spark does some really interesting things that enable some really great use cases. I think, conversely, like many shiny new objects that show up in the world, some individuals and some companies try to propose that it’s the end-all, be-all, you know, the solution to world hunger. And it’s not.

And it will cure cancer!

But it’s definitely a very important component in that toolbox for working in concert with other things like common security, common governance, as well as analytics, whether it be Spark streaming, Spark SQL, or whether it be taking advantage of in-memory capabilities and response times for batch reporting. All those things are really interesting and cool.

At the same time, just like a carpenter can misuse a tool for the wrong purpose, any of these tools can be misused for a bad purpose. I think it’s really important that, as an industry, as a community, that we’re very clear with our constituents what the tools are good for and good with, and what the tools are not so good for and good with, so that we can protect our users and the consumers of this technology from doing things that could end up either with a failure or end up sub-optimized because they haven’t been properly educated about the boundaries and the most useful use cases.

One of the biggest things right now is streaming data technology, and handling streaming data. Nifi and HDF are the new darlings at Hortonworks. What do you think about that? How is that doing? Where is it going?

Streaming is all the rage. It’s an important component to the overall architecture. This is like we talked about this morning, in connected data architectures, having access to all of the data all of the time, in real-time, near real-time, batch, whatever. Streaming as a use case is really important because being able to do in-line analytics on data as data are in motion enables action to be taken at the point of interaction with the customer, action to be taken before fraud has occurred, for instance, action to be taken in the stream. So, that’s really important.

Like anything else, there are a number of competing technologies, there are some legacy technologies trying to rebrand themselves. There is, again, some debate and confusion in the marketplace. Similar to our approach to the data at rest space, and building out a robust ecosystem, we think it’s really important in the streaming space, also, to build out and support that robust ecosystem, to future-proof the architecture that’s being built.

So, there are multiple tools and there will be multiple answers and different optimizations from each of the different tools. Our approach is to be very supportive and inclusive of that ecosystem so that we can provide the largest breadth of solutions and use-case coverage. Also, the future-proofing as new technologies come along to make it pluggable into the architecture.

The future of data is about Connected Data Platforms. With the massive volume of data generated by the Internet of Anything, businesses need a better way to get that data into the platform while identifying insights in real time. And, I say Internet of Anything because it’s not just devices, its log files, clicks, social media sentiment that are creating all of this new unstructured data. A Connected Data Platforms strategy handles both data in motion and data at rest, future proofing the business for all data.

When you say future-proofing, what are you thinking about? Flexibility of architecture over time?

Pluggable components, so that as new components come along …

You can take one out, put another one in.

You can take one out and put one in, yes, or you can keep adding to it, or rearrange. Making sure that there is choice without having to re-architect all the applications.

That’s part of the Syncsort value proposition, too.

Sure. You don’t have to recompile when you move from MapReduce to Spark. I heard that pitch today [at Hadoop Summit]. Very compelling.

So, did you have anything to do with the Syncsort partnership? Were you involved in the selection process.

Was I involved? You bet!

You drove it?

I got to fly to New Jersey in the middle of winter to make sure we got it done.

New Jersey in January, that is a sign of dedication! [laughs]

It was a real treat. I flew from Miami. [Laughs]

So, what specifically excited you about it? Why did you get on that plane in January?

Well, we worked on it for a long time, of course. But I liked a couple of things. One, I like the fact that DMX-h is truly natively running on the cluster. A lot of similar kinds of solutions are typically hybrid or loosely integrated with Hadoop versus fully immersed. From a technology perspective, you always like to choose the one that’s maybe not the most complex, but certainly the most elegant and well done. So, I like that piece of it. The second is the product we’re collaborating on. Syncsort has a broad portfolio of products, some of which I’m not familiar with. But I like the idea that DMX-h is pluggable components. It’s not the kind of tool that you have to replace everything in the enterprise with this tool. It’s very sophisticated, but it’s also very focused on the specific use case.

When I think about best of breed components working within an ecosystem, that’s the best thing you can have. There’s very little overlap with other stuff. Fly your slot, be the best at it. Prove that use case, be useful, but also plug into other components. Syncsort sells other components to go with the DMX-h product and other companies sell other products that it can work in concert with. So, you’re not getting into a whole standards dialogue. I like that.

Certainly, I also enjoyed working with the Syncsort team and the shared attitude of helping customers solve problems.

At the recent Hadoop Summit in San Jose, Scott Gnau appeared with Syncsort’s Big Data GM, Tendü Yoğurtçu, on SiliconANGLE TV to discuss the reseller partnership. They highlighted Syncsort’s competitive differentiation in the Hadoop Data Integration space and Scott detailed why Syncsort was the ETL vendor of choice for Hortonworks Connected Data Platforms.

Tomorrow, in Part 2, Paige and Scott talk about the state of governance and metadata management in Hadoop, what’s new in cyber security and why it’s important.

Related Posts