If you’re still working towards getting your act together, no worries. One of the benefits of being a Big Data late bloomer is that you won’t have to waddle through all of the platforms and products that came and went during the nebulous early years.
In the beginning, there was Hadoop. Then came a plethora of various products, some of which blossomed and became mature parts of the Hadoop ecosystem. Others petered out or are still puttering around wondering what they’re going to be when they grow up.
At the cusp of 2017, there are still quite a number of Big Data products and platforms to pick from to assemble an infrastructure that meets your needs, but it’s a little clearer what these pimply adolescents will eventually look like when they graduate high school. We’ve lined up the sexiest prospects for Big Data prom king and queen that you need to consider for ramping up your projects for the coming year.
Which Big Data products have proven their worth? Which have been tried by fire in the arena of production, and emerge ready for anything?
In the category of data processing frameworks, Apache Spark is winning the popularity contests. Spark Streaming is Spark’s entry in the streaming Big Data processing arena It is widely adopted for “real-time processing,” but we all know that “real-time” can mean different things to different people. Spark Streaming processes streaming data in micro batches, by dividing streams at predetermined intervals. This does cause some latency, but it also has the advantage of making sure that data is processed only a single time and processed reliably. Since Spark Streaming integrates so beautifully with batch Spark processing, it makes development as close to a breeze as it can be.Many organizations are interested in using a single software environment for streaming and batch processing, so developers accustomed to building batch jobs in Spark find Spark Streaming a natural choice.
Apache Storm is a true real-time streaming Big Data processing framework. It approaches each stream as an event, instead of as a string of little batches. Hence, Storm achieves impressively low latency, and is ideal for data that needs to be processed as a single entity. When it comes to production, Storm is the senior member of this team of data streaming platforms, and has a nice collection of commercial support to prove it. It also has a very active community keeping it constantly updated. The one downside to Storm is its lack of integration with YARN, and there’s no assurance that data will only be processed a single time. It does, however, run on Mesos or as a YARN slider process. Many of the more mature streaming data processing applications are powered by Storm.
In addition to a processing framework, you need a message passing layer, a way to access and move streaming data. Apache Flume is one of the more mature of these options. For years, Flume has been a hot commodity for streaming ingestion. It’s well-grounded in the overall Hadoop ecosystem, and scores support from all the commercial Hadoop distributions. That’s often enough to get Flume a front row seat in enterprise-size environments. Unlike many of us, Flume’s age has become an attractive quality. It’s remained fresh as new Hadoop products came into play. Its primary disadvantage is that (like most seniors), it tends to lose things from time to time, since it does not feature event replication.
These three are among the most mature and well-proven data streaming projects, but are certainly not your only options. Some of the new kids in school are rising stars in streaming Big Data processing.
What new tools and platforms are lookin’ good as the Hadoop ecosystem matures?
Promising Freshmen Class
Apache Flink is a new streaming data processing platform. It has many of the advantages of both Storm and Spark. Spark is a batch processor without true streaming capabilities, and Storm is a streaming solution that doesn’t offer any batch capabilities. Flume lets you do both by assuming that all data is a stream, and that batch data is simply a section of a stream. Flink often out-performs both of these seniors, though, leaving Spark Streaming in the dust, and pulling ahead of Storm as well.
Apache Apex and Flink have a lot in common. Both are freshmen that out-perform their more mature classmates. Both treat all data as streaming, and allow a simple interval to be set to switch from streaming data processing to batch. This ability to define both batch and streaming jobs in a single processing framework is proving to be essential as the Hadoop ecosystem continues to expand.
Apache Kafka is a newbie in the streaming message service category, which is rapidly gaining momentum for its scalability and fault tolerance. Kafka is showing up in more and more businesses as a data transportation backbone. Newer versions are adding capabilities for exactly once processing. It’s not just a replacement for Flume, it actually cooperates with Flume, allowing Flume to ingest streaming data into the Kafka topic stream. It also can accept data from many other sources, both batch and streaming. Following the successful strategy that Flink and Apex uses of treating all data as a stream first, it makes serving up both streaming and batch data in a single job far simpler.
One clear message from all of these promising young projects, is that a key to successful Big Data initiatives is to support both batch processing and streaming data use cases in a single development environment. This is true of your high performance integration tools as well. Syncsort DMX-h supports both batch processing and streaming data processing. It eliminates many of the complexities that IT organizations face when working to provide access to all enterprise data, including mainframe. Syncsort simplifies integration for a fast path from data to business insights.
Read Syncsort’s report, 2018 Big Data Trends: Liberate, Integrate & Trust, to see what every business needs to know in the upcoming year about Big Data, including 5 key trends to watch!