Does Streaming Mean the End of Hadoop and Batch Processing?
Hadoop has become the unapologetic poster child of big data. Though not without its challenges, Hadoop is more or less the default setting for companies looking to get into big data analysis. The Hadoop architecture, of course, is batch processing. But some say batch isn’t the future of Hadoop and big data, that the drive to achieve real time information is pushing the industry away from batch and toward streaming. So, is batch an endangered species, or is there room for both batch and streaming in the crystal ball?
Does Streaming Have to Mean the End of Batch?
Elephants are great workers, but as it happens, they aren’t so speedy.
Doug Cutting, founder of Cloudera, maintains that batch was never the point when designing Hadoop, that’s just what worked. In an email to writer Matt Asay of InfoWorld, Cutting stated, “It wasn’t as though Hadoop was architected around batch because we felt batch was best. Rather, batch, MapReduce in particular, was a natural first step because it was relatively easy to implement and provided great value. Before Hadoop, there was no way to store and process petabytes on commodity hardware using open source software. Hadoop’s MapReduce provided folks with a big step in capability.”
In fact, for all of the hoopla surrounding the success of MapReduce, one of the most touted users, Google, has long since abandoned it. MapReduce served as the basis for Hadoop, but Google’s abandonment of MapReduce doesn’t translate into the prediction of extinction for Hadoop.
As Patrick McFadin of DataStax points out (also to Asay of InfoWorld), it can be difficult for companies to see a return on investment in big data for some time. Businesses are desperately seeking ways to speed up what they get out of the data, but there are numerous steps in between investment and return on investment. Batch processing still has a place in Hadoop, but not at the onset. McFadin believes batch can be useful after the fact for running rollups and deeper analytics. The combination of batch plus real-time speed is known as the Lambda architecture. But neither Cutting or McFadin think that batch will remain at the core of Hadoop architecture.
Some See Batch as an Unnecessary Old Relic
Co-founder and CEO of Zoomdata, Justin Langseth doesn’t agree. He feels that Lambda is just an unnecessary tradeoff. In an interview with Asay, he stated, “There is now end-to-end tooling that can handle data from sourcing, to transport, to storage, to analysis and visualization,” adding, “Real-time data is obviously best handled as a stream. But it’s possible to stream historical data as well, just as your DVR can stream Gone with the Wind, or last week’s American Idol to your TV. This distinction is important, as we at Zoomdata believe that analyzing data as a stream adds huge scalability and flexibility benefits, regardless of if the data is real-time or historical.”
Langseth also points out that eliminating batch simplifies data analysis, because there is no need to fret over batch windows and batch failures and such.
What’s the Real Deal?
In the end, streaming might just end up as one of several options for handling big data analytics.
It’s difficult to argue against any of these professionals, as they each have a proven track record in terms of identifying the future of big data and developing great solution to meet needs. However, as Cutting points out, Hadoop (and, effectively, big data) isn’t moving away from batch and toward streaming, but instead streaming is joining one of the numerous options for using Hadoop and getting great analysis out of big data.
There are several great tools for real-time analysis, including Spark, Storm, and Kafka. For instance, Syncsort has partnered with Impetus Technologies to provide an integrated solution that dramatically simplifies the creation of real-time analytics. Syncsort also introduced a unique, new design approach in a new release of its data integration product suite, DMX-h, that “future-proofs” applications on Hadoop by supporting multiple compute frameworks including MapReduce and Spark. As time goes on, it’s likely that the boom of new products will lull and we’ll gradually migrate to the best tools and hone those, while allowing those that didn’t prove so useful to fall by the wayside. It’s hard to see how streaming could push out batch completely, as batch is extraordinarily useful for handling really big data sets for large-scale analysis.