Expert Interview (Part 2): Confluent’s Neha Narkhede on Schema Registry Strategy and Purpose
In part one of the conversation between Neha Narkhede, CTO and Co-Founder of Confluent, and Paige Roberts of Syncsort, we discussed the origins of Kafka and Confluent, and what the Confluent platform brings to the table. In this part, we’ll dive into the Schema Registry, its purpose and plan, and also a big announcement that Confluent made earlier this year.
Paige: One of the things I’ve been learning about is the Schema Registry.
What’s the strategy behind the Schema Registry? What’s its purpose?
The purpose is pretty simple. Although, people often realize its importance two steps down the line. The purpose is metadata management. It allows you to keep a history of how your data is defined, who owns it, what is the purpose of your data, and more importantly, it allows you to evolve it on the fly. So, when developers make changes to the application’s data, intentional or not, you don’t want your entire downstream data pipelines to break. That’s what Schema Registry provides you – the ability to enforce that automatically.
The concept of metadata management and schemas has existed in enterprises, and warehouses, and Hadoop as well. The way it is relevant to Kafka is that Schema Registry allows every message sent to a Kafka topic to adhere to a schema, allowing several versions of a schema to safely operate across various applications. Schema Registry is open source and is available through Confluent Open Source. Though it defaults to using Avro as the message format, it is flexible to allow any other data format.
So, we’re encouraging the use of Schema Registry earlier rather than later.
One of the things when you’re trying to track metadata across the enterprise is you need to look at where it came from where it’s going, and then where it ends up.
Data lineage. Yes.
Kafka may not be part of all of those steps. Do you have a way to integrate with other sections of the story?
Kafka integrates with other data systems and sources through its native Connect APIs. The connectors to systems of all types, like databases, Hadoop, Elastic, Cassandra and so on, are available through Confluent Open Source. Those connectors are out-of-the-box plug and play, connect your Kafka with all your existing systems. After all, people want to create streaming data pipelines using these off-the-shelf and well-tested connectors rather than reinvent the wheel, right?
So you can get data flowing. But then once you get data flowing, you actually want to know essentially is what you pointed out: Where did it come from? Where is it going? How complete is it end-to-end? That’s the purpose of the monitoring solution, Confluent Control Center, that you have as part of Confluent Enterprise. It gives you a full monitoring view of where and how data originated, where it is going. Does it respect your SLAs or not? — and that’s part of what we include in Confluent Enterprise. So, whether you’re a developer or operator, you have solutions that will have you covered for end-to-end data monitoring.
So, you mentioned before we started that you made an announcement in September.
So, tell me, what’s the big splash?
The big splash is two-fold: 35% of Fortune 500 Companies use Kafka now -7 out of 10 top US global banks, 8 out of 10 insurance companies, and 9 out of 10 telecom companies. I think Kafka adoption is not only soaring, it’s also pervasive across different industries.
The second announcement is Confluent Enterprise with Kafka for the Enterprise. It offers multi data center management, automatic data balancing to operate Kafka with ease as well as alerting capabilities in the Control Center. With this, you not only have the open-source version that is developer friendly, but you also now have this enterprise-ready version which lets you take Kafka into serious mission critical applications.
The next level.
Yeah. We have years of operating Kafka at scale and we want to offer those capabilities as part of an advanced version of Kafka, along with premium support by the Kafka engineers.
Where do you see it going?
I think the trend is data at rest and processing of data at rest is moving to data in motion, and it’s happening very quickly.
A few years ago, the whole trend around big data was, “the more the better.” The more data you can collect and the more data you can process in a batch fashion, the better. In that world, the value of data is directly proportional to the volume collected roughly.
But I think, what Kafka is enabling is the transition to stream data which is more like “the faster the better”. The quicker you make use of your data, the higher value you can get. The key takeaway is that, the value is dis-proportionately higher for fresher data than it is for all the data.
Related video: Real Time Streaming with Kafka and Syncsort DMX-h
That make sense.
And we see enterprises rapidly adopting that world view, that we need quicker customer 360; we need quicker sales analytics; we need quicker operations and equipment monitoring. All of that speaks for broader Kafka usage. And much more exciting, I think, it’s really changing the way companies leverage their data. That’s something we made possible at LinkedIn. Kafka at LinkedIn uses more than a trillion messages per day. It’s a humongous installation.
Yeah, it is gigantic. Microsoft and Netflix also use Kafka to process close to a trillion messages per day. So, I would personally be very excited if companies adopt this world view, and actually succeed with moving to a much more streaming-first architecture. I think in the end game, Kafka-based streaming platform will be as important as databases to companies. That’s a very exciting mission to work toward.
It is exciting.
In part 3 of this conversation with Neha Narkhede, we’ll talk about what it’s been like for Neha to be a prominent engineer and executive in the Big Data field, fighting imposter syndrome, and being an autodidact.