Simplify Apache Spark & Kafka Integration with Syncsort DMX-h v9
This blog was originally published by Keylink Technology on their Big Data & Analytics Blog
The latest version of Syncsort’s flagship data integration and ETL software for Hadoop – Syncsort DMX-h v9 is now shipping. It offers plenty of great new features and performance improvements, but today we’re going to focus on two of the standouts – Apache Spark and Kafka integration.
Firstly, hand’s up if you find it difficult to stay current with the Darwinian evolution of the Big Data ecosystem and the sheer variety of new software projects?
So it’s ok if you’re not really that familiar with Spark & Kafka – jump to the bottom of this post where we have a quick overview of the two technologies to get you started.
Now that we’re all on the same page, let’s talk about the power of the DMX-h Intelligent Execution (IX) engine when combined with Apache Spark.
The IX engine allows users to visually create jobs to prepare, blend and transform data and then execute on almost any platform without needing to recompile or tune the job:
- Develop on a standalone laptop or desktop PC
- Test on a cloud service such as Amazon AWS
- Deploy to production with on-premise server running Linux, Unix or Windows
- Run on a Hadoop cluster using the MapReduce framework and process massive volumes of data
The big news is that IX v9 now supports Apache Spark, so the same job can gain all the performance benefits of Spark’s in-memory processing by simply selecting Execute on Spark from a drop-down menu – no other changes necessary!
Syncsort is committed to supporting new processing frameworks as the ecosystem continues to evolve, so users and developers can rest easy knowing that with DMX-h they’ll never have to re-do data processing design work, regardless of how much the technology landscape shifts.
Another unique feature of DMX-h worth mentioning is the ability to process Mainframe data formats like EBCDIC and VSAM natively on Spark and MapReduce – no conversion necessary! This is particularly important for customers in the banking and finance industries where regulatory compliance, data governance and data lineage tracking is necessary when blending legacy mainframe data with streaming and social media data sources for new business insights and analytics.
Moving on to Apache Kafka integration in DMX-h v9, the key takeaway here is that you can now create batch and streaming jobs through the same graphical interface using existing skill sets – no need to go out and learn another new development language or API. Users can subscribe, transform and enrich enterprise-wide data coming from real-time Kafka queues, and then publish these enriched datasets back to Kafka to simplify the creation of real-time analytical applications by cleansing, pre-processing and transforming data in motion.
It’s great to see Syncsort customers have already found plenty of use cases that exploit these new capabilities:
- Internet Analytics Company: Analyze and find “outliers” from billions of new digital events per month from high volume sources including Internet of Things (IoT) and mobile.
- Healthcare Organization: Enable emergency units to analyze vital signs in real time to determine whether patients are in danger.
- Financial Institution: Near real-time updates for online banking through customer events processed and integrated via Kafka data bus.
- Insurance Company: Manage all customer application use cases using Spark, on a single platform.
Like to try out the new features for yourself? Take a 30-day Test Drive today.
Existing DMX and DMX-h customers can log in and download the new version (and full Release Notes) from the Syncsort MySupport portal.
What is Apache Spark?
Apache Spark is a fast in-memory computing framework that offers up to 100x faster performance than Hadoop MapReduce for in memory processing, and 10x faster processing of data from disk. Spark applications are typically created in languages such as Java, Scala, Python and R. There’s also an exciting list of add-on modules with capabilities such as SQL on Spark, machine learning (MLib) and graph processing (GraphX). Spark is included as standard with all the leading Hadoop distributions including Cloudera, Hortonworks and MapR, but it’s also worth noting that Spark can run in a standalone (non-Hadoop) configuration, and also on an Apache Mesos cluster.
What is Apache Kafka?
Apache Kafka is a high-throughput distributed messaging system with a modern cluster-centric design that offers strong durability and fault-tolerance guarantees. Data streams are partitioned and spread over a cluster of machines to allow for streams larger than the capability of any single machine. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Applications for this technology can be as varied as Internet of Things (IoT), online banking event processing, or massively multiplayer gaming (MMORPG) scoring and analytics. Kafka also provides connectivity with the other main players in the Hadoop real-time streaming space – including Spark Streaming, Apache Storm and Flink.