Welcome to our blog! Welcome to our blog!

How Do You Move Data Preparation Work from MapReduce to Spark without Re-Coding?

This blog was originally posted on the “Big Data Page by Paige” Blog.

So, is this a situation you recognize? Your team creates ETL and data preparation jobs for the Hadoop cluster, puts a ton of work into them, tunes them, tests them, and gets them into production. But Hadoop tech changes faster than Texas weather. Now, your boss is griping that the jobs are taking too long, but they don’t want to spring for any more nodes. Oh, and “Shouldn’t we be using this new Spark thing? It’s what all the cool kids are doing and it’s sooo much faster. We need to keep up with the competition, do this in real-time.”

You probably want to pound your head on your desk because, not only do you have to hire someone with the skills to build jobs on another new framework, and re-build all of your team’s previous work, but you just know that in a year or two, about the time everything is working again, some hot new Hadoop ecosystem framework will be the next cool thing, and you’ll have to do it all over again.

Doing the same work over and over again is so very not cool. There’s got to be a better way. Well, there is, and my company invented it. And now I’m allowed to talk about it.

I promised a while back to talk about some of the cool technical things that excited me about Syncsort, but I had to hold off for a while until some of them were public. Well, as of today, the cat is officially out of the bag. The announcement of the new capabilities added to version 9 of Syncsort DMX and DMX-h went out, and I already did an official Syncsort blog post with a Wizard of Oz theme,Syncsort V 9 Big Data Integration – Streaming and Kafka and Spark, Oh My! (That was a fun post to write.) So, duty done. Now, I can geek out a bit about my favorite bit of super-cool tech that my new company invented.

Intelligent eXecution (IX) is what Syncsort calls this super-cool thing, but it’s really two different things under the covers.

First, Intelligent eXecution does for Syncsort what Tungsten does for Spark in some ways, or what a good query optimizer does for a database. You design jobs in the Syncsort graphical user interface. You don’t specify HOW you want the jobs to run, just what you want them to do. This is a lot like how you write a SQL query for a database, but you don’t specify HOW that query will execute, or you define a logical DAG for Spark, but don’t specify the physical execution of that DAG.

Syncsort specializes in sorting. It’s what they do better than anyone else on any platform. Half the world’s mainframes use Syncsort software for sorting. When Syncsort moved into the ETL business, they realized that the slow choke-points of most ETL processes were in the sort-related data prep functionality. Things like joins and aggregations were where everything bogged down. They’ve done some good business just replacing ETL processes that were draggy slow, and speeding up that particular sort related job or task. Since Syncsort has hundreds of sort algorithms, the smart way to accelerate sort wasn’t making some poor schmuck guess which sort algorithm would be best in every situation and specify it at design time, it was building an engine that could derive the ideal sort algorithm at runtime based on the task at hand, the data configuration, the available resources, etc. You could call that a sort optimizer. That’s kind of the heart of what almost all Syncsort products do, but that’s just a tiny piece of what IX does.

As Syncsort moved into the Hadoop data preparation arena, the obvious choke point to fix was distributed sort and shuffle. It’s the slowest part of nearly every Hadoop job. Mainframes have been doing distributed sorts for decades, and Syncsort has been doing it better than anyone for decades. All the Syncsort engineers had to do was figure out a way to plug the Syncsort engine into the MapReduce framework. Since early MapReduce 1.x wasn’t designed to allow sort to be plugged in, or even bypassed when not needed, they dove in and contributed a bunch of code to give MapReduce that capability. MapReduce 2.x now has pluggable and bypassable sort. You’re welcome.

Now, if you design in Syncsort, you can execute locally with the Syncsort engine, or you can execute on a Hadoop cluster in the MapReduce framework with Syncsort speeding up all the sorts, shuffles, joins and aggregations. Intelligent eXecution handles things like load balancing, minimizing I/0 impact, and taking best advantages of available CPU cycles, so that job runs as efficiently as possible without you having to do a bunch of performance tuning.

That’s one very cool thing that Intelligent eXecution is: an automatic distributed ETL job execution optimizer.

Intelligent eXecution is also a layer that abstracts the design from the execution.

What that means is, you can design a job in DMX-h, and execute it with Syncsort’s own engine right on your laptop, or on an edge node, or a server. No Hadoop of any flavor involved.

Then, you can change a setting on the job you built, point at a Hadoop cluster with Syncsort installed on it, choose MapReduce as the execution framework, and execute there. No mappers or reducers defined at any time. No need to tune and tweak, adjusting big sides and little sides, etc. Then, when your boss gives you crap about switching to Spark, you change that setting, point at a Spark cluster, or a Hadoop cluster with Spark added, and execute again.

Boom. You just migrated your code from MapReduce to Spark without re-building anything.

SiliconANGLE caught on to this capability really well, and explains it probably better than I am.

The really exciting thing about this, though, isn’t just that in Syncsort DMX-h version 9.0, Intelligent eXecution now supports Spark. A couple years down the road when Flink or Heron or Storm or whatever is the next cool, fast, best framework, Syncsort plans to add that to IX. You’ll be able to grab the latest version of DMX-h, change a few settings on your jobs, and migrate again, no problem. Zero re-development work or cost, even if you end up deploying on a framework that wasn’t invented when you designed your job. That’s the idea our marketing folks call “future-proofing.” I’d call it shelter from the storm.

And no matter what framework you execute in, your jobs will all run with better performance than they would execute in that framework by itself because the sorts, aggregations and joins will still be optimized by Syncsort, the expert in sorting.

Conclusion: Intelligent eXecution is a cool tech that makes Syncsort’s future look warm and bright.


Paige Roberts

Authored by Paige Roberts

Product Manager, Big Data

Leave a Comment