This blog was originally posted on the “Big Data Page by Paige” Blog.
Our director of engineering told me that she had a customer ask if we could do real-time data processing with Syncsort DMX-h. Knowing that real-time means different things to different people, the engineer asked what exactly the customer meant by real-time. He said, “We want to be able to move our data out of the database and into Hadoop in real-time every two hours.”
When she told me that story, I wanted to quote Inigo Montoya from “The Princess Bride.” You keep using that word, “real-time.” I do not think it means what you think it means.
But what does real-time actually mean? And what do you really mean when you say real-time? What do other people usually mean when they same real-time? How can you tell which meaning people are using? And what the heck is near real-time?
Here are four different things that I believe real-time really means, and how to determine which meaning you’re using.
Generally, when engineers say “real-time”, they are usually referring to sub-second response time. In this kind of real-time data processing, nanoseconds count. Extreme levels of performance are key to success.
“Our cyber-security process has to respond in real-time to stop automated attacks from stealing customer data.”
“This stock exchange application has to bid in real-time or we’ll lose money.”
If this is what you mean when you say “real-time data processing”, then you need the data to come in, the condition for response to be evaluated, and the response to happen ̶ all generally in less than a second. And if someone else’s system can do it a few nanoseconds faster, you might lose out. In this kind of real-time, pushing the limits of performance isn’t a bonus; it’s a necessity.
Human Comfortable Response Time
What this kind of real-time processing comes down to is a commandment: “Thou shalt not bore or frustrate the users.” The performance requirement for this kind of processing is usually a couple of seconds.
“We need real-time drill down on visualizations for our business intelligence team, no matter how big the data.”
“This website needs to respond to user requests in real-time or we’ll lose sales.”
If this is what you mean when you say “real-time”, then performance matters, but it may not be the number one criteria. In some cases, a difference of a single second can be critical. For instance, if a person clicks on an ad on a web page, and the page takes 4 seconds to load, the user is likely to get bored and go look at a different web page. If that same page had loaded in 3 seconds, that user might have bought something on that web page.
For the most part, however, as long as the data gets crunched and the application responds before the user decides to go surf somewhere else, or check email or something, then the performance requirement is met.
If when you say “real-time”, you mean the opposite of scheduled, then you mean event-driven. Instead of happening in a particular time interval, event-driven data processing happens when a certain action or condition triggers it. The performance requirement for this is generally before another event happens.
“As changes are made to the database, the replication process copies them out to the cluster in real-time.”
“A listener watches for data to arrive from our customers in this location, then loads it into the system in real-time.”
In some cases, you don’t know precisely when you’ll need data processing done, but as soon as a certain thing happens, that’s when the need for data processing is triggered. Common event examples are changes in the data or user actions.
There are actually two different performance requirements for event-driven data processing. First, the data processing system has to be finished working and ready to start again before the next event happens. So, if on average, the events happen no closer together than five minutes, a data processing time frame of 2-3 minutes is excellent. If the events tend to happen an average of 10 seconds apart, then clearly, a 2-3 minute processing time would be unacceptable.
The second performance requirement may be more arbitrary. It’s the busines SLA. If for example, you want to be able to assure the CEO that his dashboards have the most current data up to the minute, then the data processing has to be able to complete within a minute of any data change in order to meet that deadline.
Streaming Data Processing
If when you say “real-time”, you mean the opposite of batch processing, then you mean streaming data processing. In batch processing, data is gathered together, and all records or other data units are processed in one big bundle until they’re all done. In streaming data processing, the data is processed as it flows in, one unit at a time. And once the data starts coming in, it generally doesn’t end. The performance requirement for streaming data processing is you must process data as fast as the data flows in.
“We’re sifting through Twitter data in real-time for mentions of our company to keep an eye on sentiment.”
“The server information in this data center is monitored in real-time to catch problems early.”
More and more, when people say “real-time data processing” these days, they are most likely referring to streaming data processing. Streaming data processing has some very specific, and sometimes tricky to implement requirements. You have to be able to process the data continuously, without start-up or clean-up overhead. Micro-batch streaming data processing frameworks like Spark Streaming have found a way to handle start-up and clean-up needs while still keeping up with streaming performance speed requirements. Streaming data processing also requires a way to deal with occasional system failures without massive data loss. In some cases, data loss is acceptable, but in others, it isn’t.
The takeaway from all these different meanings of real-time is not that you’re using the word wrong, or one definition is more right than another. It’s that when you’re thinking about implementing a real-time data processing application, it’s important to consider what kind of real-time you really mean. Based on that, you can determine what level of performance you will require.
So, what does near real-time mean? Well, near real-time is essentially something engineers say because they’re cringing inside about how ambiguous the word “real-time” has become. Seriously, why does one little hyphenated word suddenly have four different meanings? What’s up with this English language drift thing? It’s so imprecise!
Near real-time basically means any one of the definitions I mentioned, aside from sub-second response time. Although, I have heard an engineer or two use real-time to mean streaming.
So, What Does Real-Time Really, Really Mean?
At the recent Spark Summit East, Syncsort GM, Tendu Yogurtcu was asked, “What trends do you see coming up?” to which she responded, “A lot more customers are moving to real-time data processing.” Ali Ghodsi, CEO of DataBricks had the same opinion. “We’re seeing a real push for real-time.” He also saw trends toward breaking data siloes and doing more advanced analytics, “all in real-time.” Tendu also said that the future lies in combining streaming and batch on one platform.
In the same interview, Tendu and Dave Vallente of the CUBE delved into the question that spurred this blog post – what is the real definition of real-time? Dave came up with a great one that covers all four of the above meanings, and Tendu called back to it: “Respond before you lose the customer.” This, in some ways, is the best possible way to think about real-time when designing data processing systems.
Regardless of what level of performance your system has in any given situation, if you end up losing the customer, then it’s simply too slow. Maybe you should try moving up to real-time data processing.