Kafka is the kind of product that is relatively easy to describe at a high level, but when it comes down to explaining the deeper advantages and potential use cases, it gets a bit harder to fully express. Fortunately, Kafka does have excellent documentation, which delves nicely into all of the design and implementation features and functionality. To sum it up as briefly as possible, Kafka is a distributed publish-subscribe messaging system that was created as a fast, scalable, and durable alternative to existing solutions. It is designed to broker enormous message streams for extremely low-latency analysis within Enterprise Apache Hadoop.
Like most similar systems, Kafka keeps up with feeds of messages within topics. Producers create the data within the topics and consumers read from those topics. Kafka is distributed, therefore, topics are separated by partitions and replicated across various nodes. These messages are just simple byte arrays; the developers can utilize them in order to store any object in any format that they wish, including Avro, JSON, and String. Developers can also opt to attach a key to a message, guaranteeing that all messages with that specific key will get to the same partition.
During consumption from a topic, you can also configure a group with multiple consumers. Each of the consumers in a specific group will access messages from a particular subset of partitions within the topics they subscribe to. This will assure that every message is delivered to one consumer in the group, and all of the messages that carry the same key make it to the same consumer.
The uniqueness of Kafka lies in the fact that it handles each topic partition as a log (that is, an ordered set of messages), and that every message within a given partition is assigned a unique, one-of-a-kind offset. Kafka doesn’t try to track which message was actually read by what consumer and just hold on to unread messages. Instead, it holds all of the messages for a pre-specified amount of time, and consumers are charged with tracking their location within each log. So, Kafka is able to support a huge quantity of consumers and hold tremendous amounts of data without incurring much at all in the way of overhead.
Kafka was designed to deliver three distinct advantages over AMQP, JMS, etc.
1. Kafka is Highly Scalable
Kafka is a distributed system, which is able to be scaled quickly and easily without incurring any downtime.
2. Kafka is Highly Durable
Kafka persists the messages on the disks, which provides intra-cluster replication. This makes for a highly durable messaging system.
3. Kafka is Highly Reliable
Kafka replicates data and is able to support multiple subscribers. Additionally, it automatically balances consumers in the event of failure. That means that it’s more reliable than similar messaging services available.
4. Kafka Offers High Performance
Kafka delivers high throughput for both publishing and subscribing, utilizing disk structures that are capable of offering constant levels of performance, even when dealing with many terabytes of stored messages.
Kafka is a natural companion to your enterprise Hadoop infrastructure if you need a real-time solution that provides ultra-fast and reliable messaging services. For more information, check out the video: Real Time Streaming with Kafka and Syncsort DMX-h