What is Apache Kafka, and Do I Need It?
Like so many innovations, Kafka was born out of necessity — the mother of invention. At the time of its birth, LinkedIn was using three engineers to manage approximately 300 billion user events every day. Unfortunately, some of the data was getting lost. Since LinkedIn couldn’t very well go offline to fix their pipeline, they were in desperate need of a solution.
Luckily, they had acquired a senior engineer named Jay Kreps. Kreps had an idea for handling data other than sticking it in a data warehouse — an idea for handling data in motion. Though he immediately knew his concept was viable, there were countless obstacles to overcome in order to make it work correctly.
Apache Kafka is scalable, durable, and distributed by design.
LinkedIn believed in Kreps and hung in there. The small team assigned to bring Kafka to life worked a year and a half before developing a working model. When Kreps and co-developers Neha Narkhede and Jun Rao modeled Kafka, it was designed to load into Hadoop. The design is described by Apache as, “a publish-subscribe messaging rethought as a distributed commit log.” But that’s a bit too simplistic. It’s better positioned as a distributed publish-subscribe messaging system that was built to be fast, scalable, and durable.
Eventually, Kreps and crew, along with the infantile Kafka, left LinkedIn. Subsequently, open source maverick Apache adopted the child, developed it in open source, and launched it from the Apache Incubator in 2012.
Do You Need Kafka?
When is Kafka a better option than Flume or RabbitMQ? Kafka can handle data streams from multiple sources and deliver to multiple consumers without breaking a sweat. Kafka is ideally suited to managing messaging in massive multiplayer video games.
While Kafka does its thing incredibly well (it’s stable, reliable, and swift), some jobs are still better suited to Flume or even RabbitMQ. Kafka is ideally fitted to use cases like:
- Tracking website activity
- Alerting and reporting on operational metrics
- Collecting logs from multiple services and making those logs available in standard format to multiple consumers (especially Hadoop and Solr)
While other products can do one of these, Kafka is capable of taking on all of these, and more. “Apache Kafka is widely adopted for a variety of use cases involving real-time data streams and is changing how data is used in companies,” said Jay Kreps.
Syncsort’s industry leading data integration software, Connect for Big Data, now integrates with the Apache Kafka distributed messaging system, so users can leverage Connect for Big Data’s easy-to-use GUI to subscribe, transform, enrich and distribute enterprise-wide data for real-time Kafka messaging.