The Big Data Processing Dilemma: Combining Streaming and Batch Data Sources
A lot of data management is focused around organizing, storing, moving and retrieving data in big bunches all at once, batch data. However, more and more, data that comes in small bits streaming in a constant flow is gaining importance. From website actions to on-line transactions to all that data from internet of things, modern ETL has a whole new set of challenges.
An entirely new way to buffer, transport and handle data has been invented in order to handle this style of data that puts the Velocity in the 3 V’s. The dilemma for every business is this:
Handling streaming data requires new technology and methods. However, to get the best business value from streaming data, it has to be integrated with the old batch data sources, which requires old technology and methods.
Apache Kafka isn’t much use for querying a relational database to check historical account information. Neither is an RDBMS or Hive well-designed to process a thousand on line transactions per second. But, you need both the on-line transactions and the account information together to identify a potentially fraudulent transaction.
In order to get business value, you need to be able to handle both fast data and big data, and make them work together.
Let’s dig into that fraud prevention example a little deeper.
Imagine you work at a big, international bank. Fraud is a billion dollar problem that you can’t afford to ignore. Thousands of on-line transactions are flowing in. You need to evaluate each one to determine if it is fraudulent and either approve it, or do something else in seconds.
You can check things like, where does that customer live and what are their normal historical spending patterns, with your old batch systems. You can process the incoming stream of transactions with your new streaming data processing systems. But those two data flows are completely separate. Chances are, those two jobs weren’t even built in the same programming language.
Yet neither set of data alone will tell you if any given transaction is fraudulent or not.
You need to pull in the streaming transaction data, do a fast query and lookup on the customer data, and join those data sets. Based on that new combined data, you should be able to tell that yes, this might be a fraudulent transaction, or no, this transaction is normal. In order to do something about it, the “yes” transactions need to be pushed to a streaming queue to do something immediate, like send an SMS message to the customer to confirm the transaction. The “no” transactions need to be pushed to your data storage system, like a Hive table, or an RDBMS for long term trend tracking, a batch data set.
To do the job right, you’ve got to combine both streaming and batch data sources coming in, and also write back out both streaming and batch data.
Hadoop Summit is this week, which is always a crazy time. But it has us all at Syncsort especially excited and busy this year. By now, most folks know that Hortonworks is re-selling Syncsort as the ETL on-boarding solution for HDP. I’ll be the one to introduce Scott Gnau, the CTO of Hortonworks, and my boss, Tendu Yogurtcu, the GM of Big Data at Syncsort. Scott Gnau will be talking about why HDP plus an ETL on-boarding solution like Syncsort’s makes sense in modern data architectures. Tendu Yogurtcu will co-present with him on some specific customer use cases of Syncsort and HDP.
If you attend on Tuesday at 4:10 in room 230C, you might notice one thing that all the use cases have in common: Streaming and batch working together. Syncsort is confidently planting a flag as the bridge between the old and the new in data management.
In addition, Confluent, the folks behind Kafka, will be at the Syncsort booth (#1303) during most of the conference, ready to talk more about streaming data patterns, and the Kafka stack. I’ll be there with them on Wednesday during the morning and afternoon breaks at 11:00 AM and 3:40 PM to explain how Syncsort, Spark, HDP and Kafka all work together to solve the streaming data dilemma.