Keith Kohl

Deploying-Hadoop-ETL-in-the-Hortonworks-SandboxWith the General Availability of Hadoop 2, pluggable sort is now a reality for all Hadoop 2-based distributions.  With the GA of the Hortonworks Data Platform 2.0 (HDP 2.0), Syncsort is announcing that we are extending our partnership with Hortonworks, including support and certification of HDP 2.0 with YARN.

So what is YARN and why is it important?  To learn more, read my guest Blog on the Hortonworks Blog site.


IronclusterToday, Syncsort is announcing our ETL Hadoop offering for the cloud on Amazon Web Services (AWS).  As organizations and individuals seek to learn more about Hadoop, try out new use cases, scale a cluster up and down easily, quickly & affordably, they are increasingly looking at cloud-based infrastructures.

Syncsort Ironcluster: Hadoop ETL for Amazon Elastic MapReduce – Release1, is the first and only ETL tool on AWS for Elastic MapReduce (EMR), Amazon’s Hadoop cloud-based environment, available in the Amazon Marketplace!  In fact, there are a lot of firsts here:

  • First Data Integration-as-a-Service Engine for Amazon Elastic MapReduce (Amazon EMR)
  • Syncsort’s first cloud-based offering for Hadoop
  • As mentioned, the first and only ETL(Extract – Transform – Load) tool available for EMR
  • The first, and only, ETL product that is deeply integrated with MapReduce
  • A free-use version is available (more below)

There are many documented use cases for Hadoop, but a very common one is ETL.  Even when users don’t know they’re doing ETL, that’s what they’re doing.  WRT Hadoop, I’ve heard it called data refinement, data preparation, data management, etc.  But at the end of the day, they’re aggregating web logs to understand patterns, joining data to merge disparate data sources, sorting data, filtering and reformatting it, and so on.  That’s ETL!

When we started this project, our goal was to make it easy and attractive for users to get started using Ironcluster on EMR.  For instance,

  • It’s available in the Amazon Marketplace
  • There’s a free usage version available.  You still need to pay for your EC2 & EMR usage, but Ironcluster is available free of charge for up to 10 nodes.
  • The pricing is very attractive; there are 4 usage levels available depending on the number of nodes you have and the level of support you need

Usage Level

Maximum Nodes

Ironcluster Price/Hour




$0 – Free!

Online through our community




Community, email, phone




Community, email, phone




Community, email, phone


  • We provide examples and templates, what we call Use Case Accelerators, with documentation and even videos for users to get started quickly.  We have a Ironcluster resources page available to navigate the resources available
  • Nothing to download.  Everything is hosted in the cloud, including the graphical interface to develop & maintain the ETL jobs

So why is this “Release 1”?  This is obviously not the first release of our Hadoop product.  We released that back in May & June, but this is our first offering on Amazon EMR.  We’ve got many new enhancements and features planned to make the users’ experience even better.

If you happen to be at AWS re:Invent this week in Las Vegas, stop by booth #825 in re:Invent Central for a demo and to learn more from our technical experts.

Get started today and let us know what you think!


Last Monday, we announced two new DMX-h Hadoop products, DMX-h Sort Edition and DMX-h ETL Edition.  Several Blog posts last week included why I thought the announcement was cool and also some Hadoop benchmarks on both TeraSort and also running ETL.

Part of our announcement was the DMX-h ETL Pre-Release Test Drive.  The test drive is a trial download of our DMX-h ETL software.  We have installed our software on our partner Cloudera’s VM (VMware) image complete with the user case accelerators, sample data, documentation and even videos.  While the download is a little large ─ ok it’s over 3GB─ it’s a complete VM with Linux and Cloudera’s CDH 4.2 Hadoop release (the DMX-h footprint is a mere 165MB!).

Test Drive visual for Kohl blog 052813


The use case accelerators allow users to get going quickly, not only with DMX-h ETL but also with the test drive itself.  We’ve included use cases that we hear consistently: how do I identify changes in 2 different data sets (change data capture), aggregating web log data, translating and loading mainframe data into HDFS, and more.

So that you can actually use the use case accelerators, we have included sample data, comprehensive documentation and even videos.

The test drive is not your normal download.  This is actually a pre-release of our DMX-h ETL product offering.  While we have announced our product, it is not generally available (GA) yet…scheduled for end of June.  We are offering a download of a product that isn’t even available yet…how many vendors do that?!

The test drive is not only about our users’ experience. While this is the primary focus, we also want to hear back so our users’ can influence our product.  We are using our community to ask questions but also for users to provide feedback on their test drive.

The DMX-h ETL Pre-Release Test Drive has been live for just over a week, and we have had literally hundreds of downloads.  Join the crowd and take the test drive.


Today, we here at Syncsort announced two new Hadoop products – DMX-h Sort and DMX-h ETL.  I don’t want to repeat the announcement here, I’d rather talk about how I think this is unique.  When we started planning these new releases, we set our goal to: “make the hard, easy; and the impossible, possible.”

We consistently hear from our customers and partners, that ETL (some call it data refinement, data collection & preparation, etc.) is the #1 use case for Hadoop.  But if you think about typical ETL, there are some problems that are very hard with Hadoop and the distributed nature of the data on Hadoop.  Join for instance, particularly when both sides of the Join are large, is one of them.

This was a challenge that we embraced.  We not only wanted to help our customers and partners solve Hadoop issues, we wanted to make it easy to do and significantly increase performance at the same time.  Part of our announcement today is “smarter productivity” with use case accelerators for common use cases, such as Join.  These are pre-built job templates complete with documentation.  You simply fill in the metadata, and then you can execute.  Or you can try it out with the samples we include.

Another common hard (impossible?) problem is Change Data Capture – the process of taking two data sets (the current data set and the previous data set), and identify the changed data.  The changed records need to be flagged as New, Updated, or Deleted.

This can be very difficult in MapReduce when both data sets are large.  You could literally write hundreds of lines of Java code, or use DMX-h ETL. The approach is very straight forward using our GUI.

So, why is this hard in MapReduce?  Well it’s not a difficult concept, but there are difficulties in implementation.  Think about what needs to be done on the Map side before you can identify the changed records on the Reduce side.  The Map side needs to bring the data together in each Mapper, while keeping track of whether the data is from the current or the previous data set.  It then needs to be sorted before sending to the Reducers.  And don’t forget, you’re going to 1) need to have all of the same records (based on a key like customer ID) to the same reducer, and 2) do not hard code the number of reducers because you’re going to want this to be dynamic based on a number of factors, particularly data volume at execution time.

On the Reduce side, now that you have all of the like records based on some key (customer ID) in the same Reducer, you need to re-split the data from the current and previous data sets and then perform a full outer join.  Records that appear in the current version but not the previous version are inserts, and those that appear in the previous version but not the current one, are deletes. For records that appear in both, the non-primary-key fields are compared and any cases where they are not identical are updates.  Each changed record needs to be flagged, with an I, D, or U.

You can see a video of this use case here. And you can even try this out…I’ll talk about our DMX-h Pre-Release Test Drive in a blog post coming next week.

We actually have about a dozen use case accelerators right now and we will continue to add more as we work from our users.

Information about how we natively integrate into Hadoop MapReduce is available in the announcement along with some initial performance & scalability results.  We will write some blog posts over the coming weeks about the performance benefits using DMX-h.

These are not only hard problems to solve in Hadoop, but we’ve been told, impossible to solve, with other approaches (i.e., other tools) on Hadoop.  Making the hard, easy; and the impossible, possible…now that’s cool!