The Beginner’s Guide to Apache Kafka
Apache Kafka is making serious waves in the sea of data integration. The most famous user, of course, is its creator, LinkedIn, but there are tons of other businesses leveraging this technology now, too, most notably: Yahoo!, Netflix, Twitter, and Uber. So, what is this thing called Kafka? What does it do? Why is it better than the alternatives? Perhaps more pressing, is it something you need to consider?
Kafka is one of those things that can be described rather simply, but that doesn’t touch the actual depth and breadth of its capabilities. In short, Kafka is a distributed publish-subscribe messaging system that was designed with three critical factors in mind: speed, scalability, and durability. Obviously, there are many existing publish-subscribe messaging systems, so what sets Kafka apart from the herd?
How Kafka Differs from Other Messaging Systems
Kafka handles each topic partition as a log (or an ordered set of messages), assigning every message within a partition to a unique offset. It doesn’t even try to track which messages were read by a customer, only keeping unread messages. Instead, it retains all of the messages for a specified length of time. It’s up to the customer to track their location within a log. So, Kafka is able to support a huge quantity of customers, while retaining significant amounts of data, though accumulating little in the way of overhead.
Kafka Within the Hadoop Infrastructure
Kafka is able to work in concert with a number of other popular big data products and solutions, including Apache Spark, Apache Storm, Apache HBase, and other real time data analytics, rendering, and data streaming tools. It is capable of a wide variety of messaging services, including stream processing, tracking the activity on websites, collecting and monitoring a variety of metrics, log aggregation, and more. In practical terms this means that Kafka is able to handle messaging geospatial data from an entire fleet of long haul semi trucks or even sensor data that streams in from HVAC equipment. Kafka is built for brokering massive message streams for low-latency analytics within the enterprise using Hadoop.
Kafka is a natural companion to your Hadoop operations and has a lot to offer if you find other messaging systems deficient or overly hogging of system resources. If you’re ready to add Kafka to your Hadoop toolbox, Syncsort’s integration with the Kafka distributed messaging system allows you to leverage DMX-h’s easy-to-use graphical interface to subscribe, transform and enrich enterprise-wide data coming from real-time Kafka queues. It can also publish these enriched datasets to Kafka to simplify the creation of real-time analytical applications by cleansing, pre-processing and transforming data in motion.