Is this blog about yet another TeraSort benchmark?

Benchmarking is a very integral part of our life at Syncsort since the products we develop are highly focused on performance and scalability. Having said that, I often find published benchmark results well crafted. This blog is not about a single benchmark data point over thousands of nodes. It is about a set of Hadoop benchmarks focused on TeraSort and the most common Extract-Transform-Load (ETL) use cases. While it’s great to show results on thousands of nodes, I believe it is also important to show benchmark results in a more common real-world configuration. Based on what we have seen with our customers and on the data published in Apache Hadoop Wiki “Powered By Hadoop” page, we used a 10 node cluster for this particular test of our new products.

We are very excited about our new product releases, DMX-h Sort Edition and DMX-h ETL Edition. These products deliver high performance and scalable data integration with ease-of-use on Hadoop.

Now, let’s look at DMX-h Sort Edition first: DMX-h Sort delivers an alternative sort implementation for MapReduce Sort Phase. During the MapReduce data flow, the data is sorted using Syncsort’s sort algorithms instead of the native sort. Integration of DMX-h Sort is seamless; you can configure either a particular job or all jobs running on the cluster to use DMX-h Sort when possible. We ran the TeraSort benchmark with DMX-h Sort.

In today’s blog, I will focus on the TeraSort results. The tests are run with CDH 4.2, simply because this distribution is the first to include Syncsort’s contribution to the Apache MapReduce project, mapreduce-2454, in a generally available release.

Cluster configuration:

-        (10 + 1 + 1) nodes with 12 cores – Intel Xeon X5670, 2.93Ghz

-        Memory: 96GB per node

-        Disk drives: 12 x 3 TB 7200 RPM; I/O speed: 110 MB/Sec write and 140 MB/Sec read

-        HDFS block size = 256 MB

-        MapReduce version 1

The chart below displays the TeraSort benchmark results, elapsed time (clock time) to run TeraSort with native sort versus with DMX-h Sort. Map output compression is not enabled for this set of tests.

Terasort Benchmark 1

As you can see, the percentage gain in elapsed time increases as the input data size grows. With DMX-h Sort, the performance gain against the native sort scales out, from 35% for 0.5TB to over 55% for about 2 TB of data, i.e. more than 2x faster.

Let’s look at the amount of data that is being processed per unit of time per node, i.e. Megabytes of data processed per second per node (MB/sec/node). As the data size changes, the amount of data that is processed using DMX-h Sort remains constant, whereas the amount of data processed by native sort per second per node drops.

Terasort Benchmark 2

What does this mean?  Basically, you can process more data per node by using DMX-h Sort, and this is without adding more nodes to the cluster as DMX-h helps to scale within each node. Processing more data per second per node implies cost savings both for CAPEX and OPEX; you don’t have to increase your cluster size every time there is a significant jump in the data growth.  If your cluster is deployed on the cloud, this implies less usage and more cost savings.

In this particular test, we are setting a baseline with the standard TeraSort benchmark to demonstrate the benefits of using DMX-h Sort versus native sort. In Part II of this blog, we will focus on typical use cases for ETL, web log aggregation and Change Data Capture (CDC). Stay tuned…

 

{ 0 comments }

Today, we here at Syncsort announced two new Hadoop products – DMX-h Sort and DMX-h ETL.  I don’t want to repeat the announcement here, I’d rather talk about how I think this is unique.  When we started planning these new releases, we set our goal to: “make the hard, easy; and the impossible, possible.”

We consistently hear from our customers and partners, that ETL (some call it data refinement, data collection & preparation, etc.) is the #1 use case for Hadoop.  But if you think about typical ETL, there are some problems that are very hard with Hadoop and the distributed nature of the data on Hadoop.  Join for instance, particularly when both sides of the Join are large, is one of them.

This was a challenge that we embraced.  We not only wanted to help our customers and partners solve Hadoop issues, we wanted to make it easy to do and significantly increase performance at the same time.  Part of our announcement today is “smarter productivity” with use case accelerators for common use cases, such as Join.  These are pre-built job templates complete with documentation.  You simply fill in the metadata, and then you can execute.  Or you can try it out with the samples we include.

Another common hard (impossible?) problem is Change Data Capture – the process of taking two data sets (the current data set and the previous data set), and identify the changed data.  The changed records need to be flagged as New, Updated, or Deleted.

This can be very difficult in MapReduce when both data sets are large.  You could literally write hundreds of lines of Java code, or use DMX-h ETL. The approach is very straight forward using our GUI.

So, why is this hard in MapReduce?  Well it’s not a difficult concept, but there are difficulties in implementation.  Think about what needs to be done on the Map side before you can identify the changed records on the Reduce side.  The Map side needs to bring the data together in each Mapper, while keeping track of whether the data is from the current or the previous data set.  It then needs to be sorted before sending to the Reducers.  And don’t forget, you’re going to 1) need to have all of the same records (based on a key like customer ID) to the same reducer, and 2) do not hard code the number of reducers because you’re going to want this to be dynamic based on a number of factors, particularly data volume at execution time.

On the Reduce side, now that you have all of the like records based on some key (customer ID) in the same Reducer, you need to re-split the data from the current and previous data sets and then perform a full outer join.  Records that appear in the current version but not the previous version are inserts, and those that appear in the previous version but not the current one, are deletes. For records that appear in both, the non-primary-key fields are compared and any cases where they are not identical are updates.  Each changed record needs to be flagged, with an I, D, or U.

You can see a video of this use case here. And you can even try this out…I’ll talk about our DMX-h Pre-Release Test Drive in a blog post coming next week.

We actually have about a dozen use case accelerators right now and we will continue to add more as we work from our users.

Information about how we natively integrate into Hadoop MapReduce is available in the announcement along with some initial performance & scalability results.  We will write some blog posts over the coming weeks about the performance benefits using DMX-h.

These are not only hard problems to solve in Hadoop, but we’ve been told, impossible to solve, with other approaches (i.e., other tools) on Hadoop.  Making the hard, easy; and the impossible, possible…now that’s cool!

 

{ 0 comments }

In the case of Syncsort, quite a bit!  While, many times, a new website represents little more than some lipstick applied to a tired company, for a few great organizations, it can represent a rebirth.  And so it is for Syncsort today.  As a 40 plus year old technology organization, we have seen quite a bit of change in the software sector.  And, we are very proud to be that rare organization that has successfully evolved our business model from the mainframe era to the age of big data.

So what does our website say about us?  Bold colors tell you we are a bold company that is on the cutting edge of the data integration and data protection businesses.  Our relaxed look and feel tells you we are easy to work with, but a company with serious solutions.  The new functionality we offer, such as downloadable test drives, multiple “how to” videos and amazing educational content all tell customers and prospects that Syncsort has Smarter software for superior data solutions.

So today is a big day for Syncsort…our coming out.  We are proud of our new site not only because it provides enhanced capabilities and a better user experience, but because it makes a statement about Syncsort.  It highlights our successful history, ground-breaking current products that are solving some of the most pressing data problems, and our commitment to future offerings that promise to deliver the type of innovation that has allowed Syncsort to thrive.  So check out www.Syncsort.com and witness our rebirth along with all that this new site says about our company!

{ 0 comments }

If somebody asked you to do the exact same work over and over again, would you think that was a smart thing to do? Of course not. But that’s exactly what many of us are doing in our backup environments.

There are a lot of technology approaches to backup, and all of them have to deal with ever increasing amounts of data.  But they are not all equally smart. In fact, when you look at them a certain way they can be downright stupid. And while “Dumb and Dumber” may have been quite popular as a movie, it shouldn’t serve as an approach to backup.

To read my full blog on this topic, please go to Computerworld at bit.ly/13f3pcJ

 

{ 0 comments }