Syncsort’s Paige Roberts caught up with Steve Sarsfield from Hewlett Packard Enterprise (HPE) at the latest Strata. Steve is the product marketing manager for HPE Big Data Software, focused on their Vertica for SQL on Hadoop product. Steve is also a notable name in the arena of data quality and governance, and authored the book The Data Governance Imperative. Enjoy some keen industry insight in this interview between Paige and Steve.
So, we’re at Strata, and you’re a Vertica person. What do you feel the intersection is for Hadoop and Vertica?
HPE and Hadoop really intersect quite a bit when it comes to some of the innovations that we’re working on. We have some great innovations that we’re showing [here at Strata]. One of the innovations is our big data reference architectures, which we’ve designed to work in partnership with Hadoop, specifically HDFS and YARN. One of the offerings we have are these reference architectures that allow you to use YARN labels to specify compute and storage, and break up compute and storage. So if you want to make that dynamic within the organization, you can use YARN labels to specify how much compute and how much storage you want to use for any job.
The second part is that we have HPE Vertica for SQL on Hadoop. That is a product that allows you to install our Vertica engine directly into the Hadoop cluster and perform SQL queries on Hadoop. It’s 100% TPC-DS compliant, fully ANSI SQL compliant and can be installed either in the Hadoop cluster or separately as a Vertica cluster. It’s a high-performance engine, and we’re happy to show that off here at Strata, too.
Syncsort and Vertica have been pretty tight over the years.
What do you see as the synergies? What makes it such a good partnership?
Our strength is in providing very fast analytics for massive amounts of data. We focus all of our effort, from the way we store data to the way we compress columns, so that the analysis happens fast. What Syncsort brings to the table is the basic concept of getting the data into the database. That’s really important, because although we ingest data, we don’t have that completely covered. If you have complex data or particularly tricky data, we rely on our partnerships like Syncsort. I think that’s a really important component, especially in today’s age when there are so many different file formats and unstructured data and a lot of options when it comes to storing data. We need a partner like you guys to do it.
This is a question I’ve been asking everyone to get different perspectives. What do you think Hadoop is for?
Hmm, interesting question.
It’s a “make you think” question.
Hadoop is a general term that describes many projects that are going on in the open source community. Hadoop and specifically HDFS is primarily to store data at a very low cost. There’s data that companies gather that they aren’t really sure what it’s good for or what value it has. They need some low-cost place to put it. Hadoop, or at least the HDFS component of Hadoop, is a really good place for that. The whole Hadoop community is based on the fact that more and more data is coming at us. However, what we aren’t seeing is IT budgets growing by a lot. What I hear is data volumes growing by 25 to 50 percent, or more in certain companies, but IT budgets are growing by about 4 percent. So companies are looking for ways to store data at a low cost, and that’s one of the functions Hadoop does well.
The other thing is around data discovery, understanding what data you have, getting into the data to see if there’s any value there. Those two components are what I think it’s for. Beyond that, it’s pretty exciting to see all the other things that the Hadoop community is incubating. Countless projects that help companies manage big data.
What do you think of Spark?
Spark is really exciting technology. It seems like something that will be really powerful in the future. I like the concept of having both operational analytics and batch analytics in one platform, and I think that’s what they’re trying to do with Spark. Is it fully mature and ready for the enterprise yet? No. But it’s very exciting that somebody’s working on that problem. It’s very exciting for the future.
What do you mean specifically when you say operational analytics?
I’ll give you an example. Very often a company will want to look at both the stream of data that’s coming into the company and also look at three years of history. A great example of that is security analytics. I want to make sure there are no security anomalies in my web apps …
Cyber security, right, but I also maybe want to look at, well, if something happens, did it ever happen before over the last three years? At the operational side, where you’re looking at the stream, it’s sort of a heads-up display. You also have the long-term analytics that you can dig into. Shorter-running queries and longer-running queries — that’s probably the primary goal of the Spark platform. And really, the biggest benefit of that platform. That’s pretty exciting stuff.
It’s also interesting for machine learning and analytics that are often described as predictive analytics. If you want to run some k-means testing or regression testing, Spark and Vertica can do it for you.
What kinds of technologies would you use to build that? What does that stack look like?
Today, what I see are a lot of companies using Hadoop (HDFS) for their data lake. It’s low cost and is a great place to park data when you’re unsure of its value. If you want to monitor the stream, you could use Spark or some in-memory database. However, to connect HDFS, operational analytics and deep lengthy analytics, many are turning to new technologies like Kafka to distribute the data. Kafka is an important part of it. Are you guys getting into that? You must be.
We see Kafka everywhere, and that seems like exciting technology as well. So, they want to do analytics with Spark, maybe even look at the operational side or the stream with Spark, then any long-term analytics they can do with Vertica, and then share the data between those two using Kafka.
One of the things Syncsort is looking at is one interface for streaming and batch — pretty much what you’re talking about as far as getting the data in to, say, Vertica. Flowing it in at speed here, and then batching it in over there, and being able to design all of that with one interface. So it seems like Syncsort is thinking about exactly what you are thinking about on that.
Yeah. Seems like. ESB’s used to be the way we moved flowing data.
To me, it almost seems like ESBs are becoming obsolete. Things like Kafka are taking them over. That’s becoming the modern ESB.
It’s amazing how quickly things change. We had data integration being taken over somewhat by ESB, or at least we thought it would.
By the SOA concept.
Yeah, by the SOA concept. And now we have Kafka. It’s taking over. It’s interesting how things change.
Well, they’re changing, but at the same time, we went from point-to-point data integration to the ESB SOA concept for the reason that point-to-point integration that grows too big becomes a giant, unwieldy spider web. And now we’re trying to do that same point-to-point craziness with massive, distributed data volumes, and hundreds of sources on top of that …
And you don’t want to move the huge volumes of data, either.
No, you have to have a layer that separates the producers, however many hundreds of places the data is coming from, from whoever wants to consume that data.
Yeah, absolutely. It’s interesting how it has changed.
So, we talked about Kafka and Spark.
Those seem to be the two hottest technologies right now, Spark and Kafka. People are adopting those. There’s pretty strong hope for the future for both of those technologies.
Is there something else you’d like to put in a word for?
If you’re having problems getting the proper analytics on top of Hadoop, you have data stored in Hadoop and you’re unhappy with the performance or the level of analytics that you can run on it, the stability of the platform, come check out a free trial of HPE Vertica for SQL on Hadoop. You can try it and run it, and I think you’ll be happy with the performance of HPE Vertica for SQL on Hadoop.