From Batch to Real-Time : Achieving Hadoop Velocity via Apache Storm
When It Has to be NOW
What is called big data stream processing today is not a new concept for telecommunications providers. This sector has managed both streams and big data volume and velocity for years.
Every day, billions of wireless phone calls are placed. Each call must be placed in near-real time. Few callers would wait to have call requests queued in 30 minute batches. While most people are unaware of it, there is plenty of real time data processing happening behind the scenes. Before such a call can commence, the phone must be identified as a device belonging to a legitimate account with a specific carrier in good standing, and whether the call is to be judged “roaming.” All this must happen in a few seconds, despite a number of intermediate steps. It should come as no surprise that telecoms have been at the forefront of real time processing for decades.
Real time data streams have more recently made their presence known in social networks. For example, specialists refer to the Twitter data stream as “the Twitter firehose,” and users of Facebook and LinkedIn expect their Likes and Follows to be acknowledged and processed close to instantly.
Plans for new real time data streams have received increased attention for diverse applications such as traffic sensors, engine performance, computer security and remote patient monitoring systems.
One way to address real-time the increasing demand for building real-time applications is to integrate Apache Storm, a powerful, open source computational engine for real-time Big Data applications on Hadoop with Kafka, but it can be difficult to use, even for developers. The recently announced integration of Syncsort’s award-winning DMX-h data integration software with Impetus Technologies’ StreamAnalytix platform is designed to dramatically simplify the creation of real-time analytics applications, including the cleansing, pre-processing and transformation of data in motion.
To Stream or to Batch Data Is the Question (Credit: Paul Stevenson | Flickr)
When Later is OK
As appealing as streamed data may be, sometimes “later” is perfectly acceptable. Many business uses do not require immediate processing. For most organizations, accounts payable, accounts receivable and payroll are handled as batch processes. Data is collected into a staging area for later processing, then processed as a group – perhaps weekly or even monthly.
In the screenshot shown below, wireless provider Ptel only offers its customers online call records going back 45 days, and it does not push real time calls to this dashboard. The offline records are likely archived off to another system, and new calls are pushed into the dashboard — both probably in a “nightly batch.”
Sample Call Record Detail from a Wireless Provider
Staged, batch processing is not limited to these traditional accounting applications. A study by HP Laboratories looked at utility smart metering systems involving 22 trillion smart meter readings. That qualifies as Big Data. The study authors claim that utility customers are, for now, OK with “batch uploads and analysis of data,” though this HP team is also looking at streaming sensor data as a separate use case. (The investigation employed HP Vertica, which is a Syncsort Technology Alliance Partner. )
Hadoop Now and Hadoop Later
It seems unlikely that stream processing will ever replace all Hadoop loading chores. There are clear instances when both types of processing are needed. For instance, Facebook, Twitter and LinkedIn offer analytics dashboards to select account holders. At first blush, such dashboards present structured data as might originate in a relational database like Oracle or SQL Server, but that data could well have originated as log records in Hadoop.
Such dashboards have typically presented results from batch processing, though some are beginning to provide real time insights. As the need for speed increases, so does the need to scale up, and this is where Hadoop can excel.
An insight into the extent to which stream processing has become relatively more important is to examine the adoption of Apache Storm, which not only works as a Big Data processing framework for real-time applications, but also allows for “batch, distributed processing” of streamed data. Storm can process over a million records per second per node on a typical cluster. Data processed with Storm can then be passed on to Hadoop. Storm is used today by organizations as far-ranging as WebMD, Spotify, Cerner, Groupon, Yelp, Twitter and The Weather Channel. These uses include ETL, a sweet spot for Syncsort products, as WebMD notes: “We also use Storm for internal data pipelines to do ETL and for our internal marketing platform where time and freshness are essential.”
Judiciously planned stream and batch processing into Hadoop are likely to become part of even more complex systems still on the drawing boards.