What is Hadoop, really? What are its advantages and disadvantages? Where is the future of big data and data analytics headed?
Robin Bloor (@robinbloor), chief analyst for The Bloor Group, discusses these issues and more with Syncsort’s Paige Roberts. The Bloor Group is an independent research and news media analysis firm that specializes in accessing and analyzing enterprise-level software solutions, technologies, services, and markets. The firm is based in Austin, Texas.
What do you think Hadoop is for?
Robin Bloor: You could say Hadoop is an operating system but it’s not like any operating system that has existed before. And even if it was, it’s a long way short of being mature. Hadoop is an operating environment for data.
What about some particular uses of Hadoop?
Bloor: Well active archive is almost a no-brainer compared to previous choices, especially if you’ve got something expensive like a mainframe or Teradata or a big Oracle warehouse. Using Hadoop for active archive, it’s almost like, don’t even bother working out what the other options are. Just do it.
Don’t even think about it. It’s that obvious. That’s one of the reasons Syncsort came up with the mainframe distributable format. If you can archive without even changing your data, and still get to it, then like you said, it’s so far ahead of every other option out there, it’s a no-brainer.
Right. The other applications for Hadoop center around analytics. The analytics applications on Hadoop are interesting because you’ve got this mix now of near real-time and visual analytics. And visual analytics are all about knowledge discovery of one sort or another.
What do you think about the data lake/data hub concept?
This again is an immature thing but the concept of the data lake really is the idea of having a common point for the governance of all data – and everything that is involved in governance. So it’s data life cycle, it’s provenance and lineage, it’s data cleansing, metadata capture, it’s all of those things.
What are the advantages and disadvantages?
Well, if you look at what we used to have in the data warehouse world, you have the set of OLTP [transactional] applications that are recording the activities of the business, and you pipe that data into a data warehouse in order to get feedback on the activities of the business. With the data lake, it now applies to all data. It’s gone beyond the OLTP systems. It applies to data outside your organization as well as within the organization. So it’s much broader in terms of its area of action. And it now deals in an awful lot of data that wasn’t formally stored before. It’s called unstructured data, but it’s basically data in any form, rather than just well-structured database data.
What do you think is the most expensive IT or data management activity?
I actually think it’s going to vary from context to context, so there is going to be more than one answer to this question, but in the general sense, I think the governing of data to get it into a properly usable form is the greatest area of expense. A usable form would be something we could run a business application against so BI, analytics, whatever you are going to put on top of that. Getting it to that state from the point of ingest, whatever that was and wherever it came from.
What do you think of Spark?
I think it’s immature. I think because it has acquired lots of momentum, it may become a fulcrum of how streaming is done, but then again it may not. Its relevance is in two parts: One is to deal with streaming data and the other is to actually bring down latencies for things that need to be parallelized. The old architecture depended on stored data, on SSD or on spinning disks. It is superseded by a memory-based capability.
What streaming data processing technology looks the most exciting to you?
It’s nearly all immature. The products that existed to deal with streaming in terms of CEP [Complex Event Processing] technology are now, in one way or another, being evolved into the Hadoop environment. The most exciting stuff is actually the proprietary stuff. Things like Striim. Striim is one of the most exciting things because it’s actually delivering applications off the data stream. That’s more interesting than a platform that can deal with streaming, because it hasn’t been proven out by applications.
There is interesting development with Flink. There is interesting development with Apex as well, in that platform area. Then there are some interesting streaming products, and Striim is one of those, but there are others.
What do you think of Kafka?
I think it’s going to be the fulcrum of an awfully lot of activity. There is a necessity to have a very distributable message management environment that replaces the old Enterprise Service Buses and Kafka is it. And it’s probably not going to be dethroned.
I agree with that. So, Spark is the darling of the big data world right now. What do you think the next big thing is going to be?
It depends on the area of application. One will be Kafka. It’s going to become more and more important, because it deals with the distribution problem. Therefore, the importance of Kafka will rise up above perhaps even the importance of Spark or whatever replaces Spark, if it gets replaced.
But then there’s the next generation stuff. The next generation stuff is the blockchain [database] and everything that goes with that. But that’s five years away from becoming really important. Most people don’t see the blockchain yet, but they will.
Tell me about the blockchain.
It’s the technology around bitcoin. The reason most people won’t look at it in the big data world right now is that it is actually transactional technology. The advantages of the blockchain that make it important are twofold: The first advantage is that its security is bulletproof. It’s unbreakable security. It’s a write-only system, which is actually the correct way to handle data. Data should never be updated. It should always be written. The other reason that the blockchain is interesting is because it distributes the responsibility for transactional integrity as wide as you want it, basically as wide as it could be. The blockchain goes everywhere.
No one else will answer that question that way. I can guarantee it.
I am pretty sure you’re right. I’ll have to learn more about blockchain. So, one more question. What are you currently researching that’s interesting?
I think the most interesting thing that’s happening right now, because I think it will change a lot of things – it’s difficult to see what the changes will all be yet – is the changes in hardware that are happening at the moment. And there are a lot of them. This is not just one thing. It’s many things. They’re going to change some of the fundamental parameters of the way we do things.
You’ve got this 3D memory stuff that’s coming out of Intel. We’re looking at various technologies that are actively using FPGA [Field Programmable Gate Array chips]. The marriage of GPU and CPU is coming. There’s the battle – I’ve no idea how it resolves – between the ARM chip and the Intel chip. The ARM has been winning, and if it gets too powerful then Intel will be dethroned. You’ve got system-on-a-chip (SOC), which means that fundamentally you will be designing a different kind of hardware environment at the chip level. You take all of that together, and you realize that all those things are independent. You’re looking at possibly the emergence of new hardware platforms.
We have a hardware platform right now that is fundamentally CPU memory store. You could get hardware platforms that are possibly far more versatile than that. Node consisting of CPU memory storage is a particular kind of component in an architecture. If you actually have chains of system-on-a-chip, for instance, you have environments where the ability to distribute data and distribute processing is way different than it is right now. That will all have a radical impact.
And if parallelism is important then you really have to work out how you’re going to use CPU, which is fundamentally serial processing, GPU, which is fundamentally parallel processing, and FPGA which is fundamentally logic on the iron. All of those, how they balance, I have no idea. But those are the most radical things that are happening, because it will change everything above them.
Chips are the foundation of everything in data processing. If they change, everything changes. A few years back, I saw rumors that they were going to replace all the chips in data centers with ARM chips eventually. From a green point of view, that would be wonderful because of the lower energy requirements. But from a software point of view, that would be hugely disruptive. Every bit of software that has ever been written is going to be wrong, and have to be redone.
It might not be wrong actually. Linux runs on the ARM chips, and everything runs on Linux. So you could put it all on there.
But you have an actual different balance of things. An ARM chip has a different processing capability from an X86 chip. So, you’re going to have different arrangements. Yes, you can port the software, but the software behavior, the latencies and things like that, will have different patterns. It’s not going to be that easy.
And the more tightly coupled the software is to the chip behavior, the more it expects X86 style processing, the more it will be particularly off. Anything that uses vectorization to match chip cache for instance, is not going to work.
Yes, but the work that’s going on in the area of Software Defined Networks ultimately resolves the problem, when you get sophisticated enough, because it knows what resources are there and deploys on the resources according to what the workload is.
Sort of like a giant optimizer?
Well, that’s a very simple statement of something that’s not very simple to do. But as that gets more sophisticated, then the problem of “hard” and “soft” changes. “Where the rubber meets the road,” is how I think of it. Once you have that, you can have the hardware layer be as clever as you want, and have Software Defined Networking work out how to use it.
So that’s how I think it will happen.
We’ll see in a few years if you’re right.
There’s always the possibility that that’s completely wrong and something else will happen.
[laughter] That’s the trouble with looking into a technological crystal ball.
In addition to Robin’s excitement about the possibility of new hardware platforms, you can hear him talk about the cool new frameworks and data sources out there, and the promise they hold, including Spark and MapReduce, Kafka and NiFi, and how organizations can maximize potential of their legacy systems as well as their data lake by fusing legacy and modern architectures in a Bloor, “Hot Technologies of 2016″ webcast.