Summer’s winding down, and so is our Syncsort Summer School series. In today’s post, we’re recapping of all things Apache Kafka.
Despite joining an already crowded Hadoop ecosystem, Kafka has been generating lots of attention. Enterprises are making room for the tool, because it is truly capable of handling data streams in real time.
The Origins of Kafka
Like so many innovations, Kafka was born out of necessity. At the time of its birth, LinkedIn was using three engineers to manage approximately 300 billion user events every day. Unfortunately, some of the data was getting lost. Since LinkedIn couldn’t very well go offline to fix their pipeline, they were in desperate need of a solution. In 2011, LinkedIn engineers Jun Rao, Jay Kreps and Neha Narkhede built Apache Kafka.
According to Narkhede: “…the typical way for businesses to know what’s going on was through batch processing once a day at midnight. That’s too slow, now. At LinkedIn, we were responsible for LinkedIn’s data systems and we said, ‘Wow. We should actually know how LinkedIn is doing – how users are accessing the website in real time.’ So we looked at everything in the space that was available, and there wasn’t really a good solution for it.”
The creators then open sourced it and watched as it was quickly adopted by thousands of companies.
Apache Kafka Basics
In short, Kafka is a distributed publish-subscribe messaging system that was designed with three critical factors in mind: speed, scalability, and durability. It collects large quantities of data in real time as it streams in via user interactions, logs, application metrics, IoT devices, stock tickers, etc., and delivers it as a real-time data stream ready for use.
As noted in our Apache Kafka Beginner’s Guide, the thing that sets Kafka apart from the herd is that it is built to play well with your existing big data tools and solutions, including many of the Apache products like Spark and Storm.
In our recent conversation with Databricks’ Reynold Xin, he discusses how Spark integrates with Kafka really well. “We took care of all the details so the user doesn’t have to worry about it. All they need to do is spark.readstream and then the Kafka stream information, and put in the topic you want to subscribe to, and now you’ve got a DataFrame.”
Analyst Robin Bloor weighed in “There is a necessity to have a very distributable message management environment that replaces the old Enterprise Service Buses, and Kafka is it.” He also believes it will likely not be dethroned.
Kafka is ideally fitted to certain use cases such as tracking website activity, and alerting/reporting on operational metrics. It can also collect collect logs from multiple services and making those logs available in standard format to multiple consumers, like Hadoop.
Expanding to the Enterprise
In late 2016, Kafka was promoted to the Enterprise. Recognizing the need for Kafka in larger operations, the team at Confluent set to work on creating an enterprise edition.
Neha Narkhede shared some details regarding the enterprise upgrade, “We just took the feature set, and made it a lot more operational, more rock solid, especially for companies who are serious about using Kafka in production. They need things like multi-data center capability, much more significant monitoring capability, and a lot of operational smarts. That goes in the enterprise version.”
Streaming Data Integration
If you’re interested in taking on Kafka for your use cases, Syncsort’s integration with the Kafka distributed messaging system allows users to leverage the easy-to-use graphical interface of its data integration solution, DMX-h to subscribe, transform and enrich enterprise-wide data coming from real-time Kafka queues.
In Fernanda Tavares post on integrating streaming data with Kafka, the Syncsort VP of Data Integration R&D explains, “By adding point-and-click support to Kafka sources and targets in DMX-h, we’ve made this technology more accessible to Enterprise customers and allowed them to combine batch and streaming data processing in a single platform. We did this to make Kafka more accessible to our customers because they don’t have to write code to leverage it or when they upgrade to new versions.”