Big Data 101: Dummy’s Guide to Batch vs. Streaming Data

Are you trying to understand Big Data and data analytics, but are confused by the difference between stream processing and batch data processing? If so, this article’s for you!

Batch Processing vs. Stream Processing

The distinction between batch processing and stream processing is one of the most fundamental principles within the Big Data world.

There is no official definition of these two terms, but when most people use them, they mean the following:

  • Under the batch processing model, a set of data is collected over time, then fed into an analytics system. In other words, you collect a batch of information, then send it in for processing.
  • Under the streaming model, data is fed into analytics tools piece-by-piece. The processing is usually done in real time.

Those are the basic definitions. To illustrate the concept better, let’s look at the reasons why you’d use batch processing or streaming, and examples of use cases for each one.

Watch our webcast: DMX Change Data Capture Keeps Your Data Lake Fresh!

Batch Processing Purposes and Use Cases

Batch processing is most often used when dealing with very large amounts of data, and/or when data sources are legacy systems that are not capable of delivering data in streams.

Data generated on mainframes is a good example of data that, by default, is processed in batch form. Accessing and integrating mainframe data into modern analytics environments takes time, which makes streaming unfeasible to turn it into streaming data in most cases.

Big Data 101: Stream Processing vs Batch Processing

Batch processing works well in situations where you don’t need real-time analytics results, and when it is more important to process large volumes of information than it is to get fast analytics results (although data streams can involve “big” data, too – batch processing is not a strict requirement for working with large amounts of data).

Stream Processing Purposes and Use Cases

Stream processing is key if you want analytics results in real time. By building data streams, you can feed data into analytics tools as soon as it is generated and get near-instant analytics results using platforms like Spark Streaming.

Stream processing is useful for tasks like fraud detection. If you stream-process transaction data, you can detect anomalies that signal fraud in real time, then stop fraudulent transactions before they are completed.

Big Data 101: Stream Processing vs Batch Processing

Turning Batch Data into Streaming Data

As noted, the nature of your data sources plays a big role in defining whether the data is suited for batch or streaming processing.

That doesn’t mean, however, that there’s nothing you can do to turn batch data into streaming data to take advantage of real-time analytics. If you’re working with legacy data sources like mainframes, you can use a tool like DMX-h to automate the data access and integration process and turn your mainframe batch data into streaming data.

This can be very useful because by setting up streaming, you can do things with your data that would not be possible using streams. You can obtain faster results and react to problems or opportunities before you lose the ability to leverage results from them. Watch our video Real Time Streaming with Kafka and Syncsort DMX-h for more information.

Download DMX-h for free today!

Christopher Tozzi

Authored by Christopher Tozzi

Christopher Tozzi has written about emerging technologies for a decade. His latest book, For Fun and Profit: A History of the Free and Open Source Software Revolution, is forthcoming with MIT Press in July 2017.
0 comments

Leave a Comment

*