Integrating Streaming Big Data with Kafka
Kafka is a distributed messaging system that has generated a lot of interest among our customers. Some of those customers were already using a messaging system, but wanted to switch to Kafka in order to massively increase the number of messages they generate and process. Other customers realize that Kafka could be a game changer for their business by allowing them to stream Kafka message data for real-time analysis.. One of many use cases for this is for cable or telephone companies to be able to get alerts on events like dropped calls as they happen, so they can respond immediately. Also, hospitals can get alerts to life threatening conditions from machines monitoring their patients. Another use case is for online banking to get near real-time updates through customer events processed and integrated via a Kafka data bus.
So, how could these Enterprise customers get started using Kafka?
It made sense for us to add streaming data support to DMX-h by reading from and writing to Kafka topics. This allowed our users to select Kafka topics as sources and targets from our GUI, and leverage the full ETL power of DMX-h when analyzing the contents of Kafka messages. DMX-h users could also leverage a MapReduce or Spark cluster to parallelize the consumption of Kafka messages by letting each node of the cluster read from a subset of partitions in the Kafka topic.
The Kafka project has done a lot of maturing in the past year. As we started our development the Kafka APIs were still changing, and changes were not backwards compatible. Our engineering team started working with Kafka 0.8.2.2, and kept up with the changes as new versions were released. Based on our discussions with our Confluent and Cloudera partners, we decided to release support once Kafka 0.9.0.0 was available, as the API was stabilized and security was added.
During development, we worked closely with some of our customers as design partners. Based on their feedback, we decided to support consuming messages in batch sizes defined either as number of messages, or time interval, to provide more flexibility. We also made sure DMX-h is a reliable Kafka consumer: messages read are only marked as committed in Kafka once DMX-h has written to persistent storage, such as a file or database table. This avoid loss of data if any part of the ETL process fails or needs to be restarted.
If any messages are rejected by Kafka, the DMX-h producer records the rejected messages in a file, so the process can be corrected and the rejected messages can be re-processed, with no loss of data.
By adding point-and-click support to Kafka sources and targets in DMX-h, we’ve made this technology more accessible to Enterprise customers and allowed them to combine batch and streaming data processing in a single platform. We did this to make Kafka more accessible to our customers because they don’t have to write code to leverage it or when they upgrade to new versions.