Big Idea Friday

What is Apache Kafka, and Do I Need It?

Like so many innovations, Kafka was born out of necessity — the mother of invention. At the time of its birth, LinkedIn was using three engineers to manage approximately 300 billion user events every day. Unfortunately, some of the data was getting lost. Since LinkedIn couldn’t very well go offline to fix their pipeline, they were in desperate need of a solution.

Luckily, they had acquired a senior engineer named Jay Kreps. Kreps had an idea for handling data other than sticking it in a data warehouse — an idea for handling data in motion. Though he immediately knew his concept was viable, there were countless obstacles to overcome in order to make it work correctly.

Defining Kafka

Kafka
Apache Kafka is scalable, durable, and distributed by design.

LinkedIn believed in Kreps and hung in there. The small team assigned to bring Kafka to life worked a year and a half before developing a working model. When Kreps and co-developers Neha Narkhede and Jun Rao modeled Kafka, it was designed to load into Hadoop. The design is described by Apache as, “a publish-subscribe messaging rethought as a distributed commit log.” But that’s a bit too simplistic. It’s better positioned as a distributed publish-subscribe messaging system that was built to be fast, scalable, and durable.

Eventually, Kreps and crew, along with the infantile Kafka, left LinkedIn. Subsequently, open source maverick Apache adopted the child, developed it in open source, and launched it from the Apache Incubator in 2012.

Do You Need Kafka?

Hadoop
When is Kafka a better option than Flume or RabbitMQ? Kafka can handle data streams from multiple sources and deliver to multiple consumers without breaking a sweat. Kafka is ideally suited to managing messaging in massive multiplayer video games.

While Kafka does its thing incredibly well (it’s stable, reliable, and swift), some jobs are still better suited to Flume or even RabbitMQ. Kafka is ideally fitted to use cases like:

  • Tracking website activity
  • Alerting and reporting on operational metrics
  • Collecting logs from multiple services and making those logs available in standard format to multiple consumers (especially Hadoop and Solr)

While other products can do one of these, Kafka is capable of taking on all of these, and more. “Apache Kafka is widely adopted for a variety of use cases involving real-time data streams and is changing how data is used in companies,” said Jay Kreps.

Syncsort’s industry leading data integration software, DMX-h, now integrates with the Apache Kafka distributed messaging system, so users can leverage DMX-h’s easy-to-use GUI to subscribe, transform, enrich and distribute enterprise-wide data for real-time Kafka messaging.

Christy Wilson

Authored by Christy Wilson

Syncsort contributor Christy Wilson began writing for the technology sector in 2011, and has published hundreds of articles related to cloud computing, big data analysis, and related tech topics. Her passion is seeing the fruits of big data analysis realized in practical solutions that benefit businesses, consumers, and society as a whole.
0 comments

Leave a Comment

*