ETL

I heard a collective gasp from the audience and I breathed a small sigh of relief when I reviewed and ran live two graphical map reduce jobs that replaced 988 lines of SQL. The run time on the Hadoop cluster I was connected to was a mere 12 minutes instead of the 2 hours it took in production in a very large Oracle instance. I was definitely shocked at how many people came up to me after the meetup and commented on how nice it was to see something running live. I guess they see a lot of PowerPoints at meetups, but I think it’s important to show the real deal.

The use case was from a recent POC I had completed with Syncsort’s DMX-h product processing 800 million rows of pharmaceutical/retail data through very CPU intensive ETL steps – 2 Joins, 1 group by or aggregation – yet another perfect candidate for processing using DMX-h and Hadoop. The current production job runs in Oracle and is mission critical to the business to promote improved and more targeted customer marketing campaigns. What makes it an even more amazing fit for the combination is the growth of data month over month and the sheer number of similar jobs that bog down the Enterprise Data Warehouse (EDW).

In effect, Hadoop and DMX-h was used to demonstrate how easy it is to offload such jobs to MapReduce. Ok, I know you’re probably waiting for the gory details of run times before versus after it was offloaded to Hadoop. Oracle run time – 2 hours and 40 minutes after many months of tuning the code and the underlying infrastructure compared to 12 min in DMX-h map reduce with no tuning – completely out of the box! I was thrilled and moreover proud of what I had just proved. And the ease with which I was able to develop the simple point and click map reduce job in DMX-h in a day and run it in 12 min!

job

I did NOT want to stop there – I wanted to keep going, so I decided to run the same job with 2X and 5X the amount of data to prove the much talked about scalability of Hadoop and DMX-h. The results were spectacular ─ 21 min and 51 min respectively (as you will see if you click the link to the YouTube video at the end of this blog). Now, did we test 2X the volume in Oracle?! Yes, indeed and it ran for 13 hours after tuning – further proving that offloading resource intensive processes from your EDW (in this case Oracle, but it could be any RDBMS) can pay out huge benefits. Think about all the free EDW resources you will have as a result! And all the data you can now process with no effort!

A huge portion of this I left out – the load to HDFS! This is huge because it took only 4 min to load 1X production data volume to HDFS and only 7 min for 2X. Guess how long it took to load the 3 Oracle source tables with 2X the volume… 20 hours due to the highly indexed table structures and the extremely busy data warehouse.

So now you know exactly why there was a collective gasp from the audience.

In conclusion, I would suggest a few things:

-DO NOT be conservative when planning data warehouse offload to Hadoop – DMX-h will make it seamless

-DO NOT worry about the ease of conversion from data warehouse SQL to DMX-h

-Do not worry about Java expertise or code generation – My ETL product background is all I needed!

To view my presentation on YouTube, click here.

{ 3 comments }

A good friend has recently embraced bee keeping and has been posting pictures and videos on line (together with a live web cam from the hive). One of his videos he posted the other day showed the queen bee’s interaction with the other bees in the hive (which I was aware of but had never seen).  It reminded me of how on a recent interview with a journalist, despite me sharing several anecdotes about useful discoveries thanks to Big Data, the one that made it into the article was the one about queen bee analysis. It involves analysis of information using Big Data technologies like Hadoop that allows you to spot people that have a disproportionate impact on the users around them.

The first time I came across this kind of analysis was when working with a telco – the chief architect had noticed some interesting things about his teenage daughter’s interactions with her friends. Even though she was on the lowest cost plan with the network and had very few talk minutes, it included unlimited SMS and text messaging and she was using them extensively. She had also created a friends group where with a few clicks anything she wanted to share could be forwarded to a large distribution list. As a result, the company viewed her as a low value customer given her limited revenue, but when her dad got her a new handset and she started updating her friends about the new features, many of them upgraded (some even switching networks to do it).

The ability to recognize these users requires the combination of a large number of different indicators − often housed in a variety of different systems − it’s not enough just to find users that send a lot of messages as often the highest volumes come from spammers.  So you need to find users that interact with other users and create a response. Also interactions can switch, for example, an SMS message may then result in a Facebook post or tweet which could then cause a phone call.

The interesting thing about queen bee analysis is that it’s not constrained to a single vertical – the same interactions that occur in telecommunications are relevant in retail, financial services, life sciences etc. and can also identify “queen bees” that could have dramatic impact ─ both positive and negative ─ on a company’s bottom line. This is clearly an interesting topic, so it might warrant some more detailed explanation or comments from one of our internal experts. Please reach out to me and let me know if you’d like to know more.

The article that included the mention from me is located here http://bit.ly/13FmaYv.

Queen Bee

 

{ 0 comments }

Is this blog about yet another TeraSort benchmark?

Benchmarking is a very integral part of our life at Syncsort since the products we develop are highly focused on performance and scalability. Having said that, I often find published benchmark results well crafted. This blog is not about a single benchmark data point over thousands of nodes. It is about a set of Hadoop benchmarks focused on TeraSort and the most common Extract-Transform-Load (ETL) use cases. While it’s great to show results on thousands of nodes, I believe it is also important to show benchmark results in a more common real-world configuration. Based on what we have seen with our customers and on the data published in Apache Hadoop Wiki “Powered By Hadoop” page, we used a 10 node cluster for this particular test of our new products.

We are very excited about our new product releases, DMX-h Sort Edition and DMX-h ETL Edition. These products deliver high performance and scalable data integration with ease-of-use on Hadoop.

Now, let’s look at DMX-h Sort Edition first: DMX-h Sort delivers an alternative sort implementation for MapReduce Sort Phase. During the MapReduce data flow, the data is sorted using Syncsort’s sort algorithms instead of the native sort. Integration of DMX-h Sort is seamless; you can configure either a particular job or all jobs running on the cluster to use DMX-h Sort when possible. We ran the TeraSort benchmark with DMX-h Sort.

In today’s blog, I will focus on the TeraSort results. The tests are run with CDH 4.2, simply because this distribution is the first to include Syncsort’s contribution to the Apache MapReduce project, mapreduce-2454, in a generally available release.

Cluster configuration:

-        (10 + 1 + 1) nodes with 12 cores – Intel Xeon X5670, 2.93Ghz

-        Memory: 96GB per node

-        Disk drives: 12 x 3 TB 7200 RPM; I/O speed: 110 MB/Sec write and 140 MB/Sec read

-        HDFS block size = 256 MB

-        MapReduce version 1

The chart below displays the TeraSort benchmark results, elapsed time (clock time) to run TeraSort with native sort versus with DMX-h Sort. Map output compression is not enabled for this set of tests.

Terasort Benchmark 1

As you can see, the percentage gain in elapsed time increases as the input data size grows. With DMX-h Sort, the performance gain against the native sort scales out, from 35% for 0.5TB to over 55% for about 2 TB of data, i.e. more than 2x faster.

Let’s look at the amount of data that is being processed per unit of time per node, i.e. Megabytes of data processed per second per node (MB/sec/node). As the data size changes, the amount of data that is processed using DMX-h Sort remains constant, whereas the amount of data processed by native sort per second per node drops.

Terasort Benchmark 2

What does this mean?  Basically, you can process more data per node by using DMX-h Sort, and this is without adding more nodes to the cluster as DMX-h helps to scale within each node. Processing more data per second per node implies cost savings both for CAPEX and OPEX; you don’t have to increase your cluster size every time there is a significant jump in the data growth.  If your cluster is deployed on the cloud, this implies less usage and more cost savings.

In this particular test, we are setting a baseline with the standard TeraSort benchmark to demonstrate the benefits of using DMX-h Sort versus native sort. In Part II of this blog, we will focus on typical use cases for ETL, web log aggregation and Change Data Capture (CDC). Stay tuned…

 

{ 0 comments }

Today, we here at Syncsort announced two new Hadoop products – DMX-h Sort and DMX-h ETL.  I don’t want to repeat the announcement here, I’d rather talk about how I think this is unique.  When we started planning these new releases, we set our goal to: “make the hard, easy; and the impossible, possible.”

We consistently hear from our customers and partners, that ETL (some call it data refinement, data collection & preparation, etc.) is the #1 use case for Hadoop.  But if you think about typical ETL, there are some problems that are very hard with Hadoop and the distributed nature of the data on Hadoop.  Join for instance, particularly when both sides of the Join are large, is one of them.

This was a challenge that we embraced.  We not only wanted to help our customers and partners solve Hadoop issues, we wanted to make it easy to do and significantly increase performance at the same time.  Part of our announcement today is “smarter productivity” with use case accelerators for common use cases, such as Join.  These are pre-built job templates complete with documentation.  You simply fill in the metadata, and then you can execute.  Or you can try it out with the samples we include.

Another common hard (impossible?) problem is Change Data Capture – the process of taking two data sets (the current data set and the previous data set), and identify the changed data.  The changed records need to be flagged as New, Updated, or Deleted.

This can be very difficult in MapReduce when both data sets are large.  You could literally write hundreds of lines of Java code, or use DMX-h ETL. The approach is very straight forward using our GUI.

So, why is this hard in MapReduce?  Well it’s not a difficult concept, but there are difficulties in implementation.  Think about what needs to be done on the Map side before you can identify the changed records on the Reduce side.  The Map side needs to bring the data together in each Mapper, while keeping track of whether the data is from the current or the previous data set.  It then needs to be sorted before sending to the Reducers.  And don’t forget, you’re going to 1) need to have all of the same records (based on a key like customer ID) to the same reducer, and 2) do not hard code the number of reducers because you’re going to want this to be dynamic based on a number of factors, particularly data volume at execution time.

On the Reduce side, now that you have all of the like records based on some key (customer ID) in the same Reducer, you need to re-split the data from the current and previous data sets and then perform a full outer join.  Records that appear in the current version but not the previous version are inserts, and those that appear in the previous version but not the current one, are deletes. For records that appear in both, the non-primary-key fields are compared and any cases where they are not identical are updates.  Each changed record needs to be flagged, with an I, D, or U.

You can see a video of this use case here. And you can even try this out…I’ll talk about our DMX-h Pre-Release Test Drive in a blog post coming next week.

We actually have about a dozen use case accelerators right now and we will continue to add more as we work from our users.

Information about how we natively integrate into Hadoop MapReduce is available in the announcement along with some initial performance & scalability results.  We will write some blog posts over the coming weeks about the performance benefits using DMX-h.

These are not only hard problems to solve in Hadoop, but we’ve been told, impossible to solve, with other approaches (i.e., other tools) on Hadoop.  Making the hard, easy; and the impossible, possible…now that’s cool!

 

{ 2 comments }