Eeny Meeny Miny Mo … To Which Hadoop Option Should I Go?
There are now a number of different Hadoop engines to choose from, and for those just getting into analytics (or those looking to replace an option that isn’t meeting their needs), it can be daunting just to select the right engine. While MapReduce is slowly being replaced by Apace Spark and Apache Storm, there are still times when MapReduce is your best option. Then you must evaluate all the newbies, like Flink, and determine what, if anything, those engines have to offer your analytics initiatives. How can you choose? Here’s the lowdown.
MapReduce was the Hadoop mainstay for a long time, but it’s gradually being replaced by easier and speedier options. Still, it’s mature and very solid, making it ideal for some jobs. When the data begins to grow beyond a few hundred gigabytes or includes unstructured data, MapReduce can take over for the old relational database.
Typically, when data grows from hundreds of gigabytes to the arena of petabytes (especially when the data includes semi-structured and unstructured data), it’s time to swap out the old RDBMS for MapReduce. While it’s slow compared to the newer options like Spark and Storm, it is relatively easy to write and it is quite scalable. The problem is, as data sets grow, the performance of MapReduce is significantly affected. If you double the amount of data, you will also double the time it takes MapReduce to process a query. MapReduce isn’t an option if you need big data for streaming or other real-time processes.
Spark is hailed as running 100 times faster than MapReduce when run in memory, and 10 times faster when run on disk. It’s also a bit easier to write Spark applications than it is to write for MapReduce, and it understands several languages: Java, Scala, Python, and R. Spark also allows you to make the best of all the worlds: you can SQL with streaming and high-level analytics. Spark and Spark streaming are generally faster than Storm, but Storm has performance advantages over Spark. Syncsort recently delivered an open source contribution of an IBM z Systems mainframe connector that makes mainframe data available to the Spark open-source analytics platform and announced new integration of the “Intelligent Execution” capabilities of its DMX data integration product suite with Spark to allow users to visually design data transformations once and then run them anywhere – across Hadoop, MapReduce, Spark, Linux, Windows, or Unix, on premise or in the cloud.
Storm is a speedy option, making it perfect for jobs that don’t tolerate latency, such as processing online financial transactions.
Storm is another streaming solution, but writing processing operations for Storm is not easy, especially if you’re new at it. Storm code is written in Java or Scala. Storm’s strength is that it’s fast — very fast — making it ideal for tasks that won’t tolerate latency, like financial transactions. Storm is also easily scalable, pretty tolerant and resilient when it comes to faults, and reliable.
Apache Tez is built on YARN. It’s agonistic in terms of the data types it processes, and while it offers excellent performance, it is not easy to write. Most developers need to utilize Cascading to help with the coding. Tez is pretty easy to deploy, and is capable of handling dynamic physical data flow decisions.
Flink shows a lot of potential. It unifies the messaging and batching tasks with smarter memory management. It provides low latency and isn’t too difficult to develop for, generally speaking. The only drawback is that Flink is not yet a mature alternative. Until it is more mainstream, you might not get the functionality and support you can with Spark and Storm now. If it’s possible to make a go with Spark or Storm, it is probably best to stick it out there until Flink has some time to grow up.