Data Integration

Hadoop ETL – Offloading the Enterprise Data Warehouse

Hadoop is quickly becoming the operating system for Big Data, a platform that provides powerful services for developers and vendors to create Big Data applications. As this operating system becomes the new standard for Big Data, new and exciting use cases for Hadoop continue to emerge. Therefore, I’m happy to start our new series “Most Popular Use Cases for Hadoop” where we will look at some of these use cases, their motivations and potential benefits; and why not start with what we know best, Hadoop ETL.

Over ten years ago, ETL tools promised a simple approach to load data from multiple sources, transform it into valuable information and load it into a common repository – the enterprise data warehouse – where business users could leverage it for competitive insights. Over time, however ETL tools were overwhelmed by the accelerating volume, velocity and variety of data. As IT organizations realized their ETL tools couldn’t scale, they tried to keep up by adding more hardware, increasing database capacity and pushing the transformations – the “T” in ETL – into the data warehouse. Unfortunately, the data warehouse is not the best place to do this type of work (You can see why in one of the all time favorite blogs ETL vs. ELT.) Nevertheless, SQL became one of the most popular ETL tools. But that is soon to change… the truth is, after struggling for years to implement and scale conventional ETL tools, many organizations are now looking at Hadoop to collect, process and distribute more data than ever before at a disruptive cost.

So where to begin? Well, organizations can start by identifying some of the heaviest data transformations occurring in their data warehouse environments. Normally, 20% of the transformations can consume up to 80% of database capacity. Then, they can shift those transformations out from the data warehouse and into Hadoop. This approach will allow them to realize significant benefits very quickly, including: shortened ETL batch windows, faster database user queries, and significant operational savings in the form of spare database capacity.

However, when you do that, keep in mind Hadoop is not a complete ETL solution. Failing to recognize the gaps between the operating level services Hadoop provides and the functionality that users expect when deploying Enterprise ETL can create frustration and hamper the benefits of Hadoop… and that’s exactly where DMX-h ETL Edition comes into play, making sure you have everything you need to unleash the power of Hadoop and deploy a smarter approach to ETL. After all, we’ve been helping organizations offload the “T” from the data warehouse for years!

So, What challenges are you facing when deploying ETL in Hadoop? What is prompting you to offload the “T” from your data warehouse?

Solutions-HadoopETL

{ 0 comments }

A good friend has recently embraced bee keeping and has been posting pictures and videos on line (together with a live web cam from the hive). One of his videos he posted the other day showed the queen bee’s interaction with the other bees in the hive (which I was aware of but had never seen).  It reminded me of how on a recent interview with a journalist, despite me sharing several anecdotes about useful discoveries thanks to Big Data, the one that made it into the article was the one about queen bee analysis. It involves analysis of information using Big Data technologies like Hadoop that allows you to spot people that have a disproportionate impact on the users around them.

The first time I came across this kind of analysis was when working with a telco – the chief architect had noticed some interesting things about his teenage daughter’s interactions with her friends. Even though she was on the lowest cost plan with the network and had very few talk minutes, it included unlimited SMS and text messaging and she was using them extensively. She had also created a friends group where with a few clicks anything she wanted to share could be forwarded to a large distribution list. As a result, the company viewed her as a low value customer given her limited revenue, but when her dad got her a new handset and she started updating her friends about the new features, many of them upgraded (some even switching networks to do it).

The ability to recognize these users requires the combination of a large number of different indicators − often housed in a variety of different systems − it’s not enough just to find users that send a lot of messages as often the highest volumes come from spammers.  So you need to find users that interact with other users and create a response. Also interactions can switch, for example, an SMS message may then result in a Facebook post or tweet which could then cause a phone call.

The interesting thing about queen bee analysis is that it’s not constrained to a single vertical – the same interactions that occur in telecommunications are relevant in retail, financial services, life sciences etc. and can also identify “queen bees” that could have dramatic impact ─ both positive and negative ─ on a company’s bottom line. This is clearly an interesting topic, so it might warrant some more detailed explanation or comments from one of our internal experts. Please reach out to me and let me know if you’d like to know more.

The article that included the mention from me is located here http://bit.ly/13FmaYv.

Queen Bee

 

{ 0 comments }

Last Monday, we announced two new DMX-h Hadoop products, DMX-h Sort Edition and DMX-h ETL Edition.  Several Blog posts last week included why I thought the announcement was cool and also some Hadoop benchmarks on both TeraSort and also running ETL.

Part of our announcement was the DMX-h ETL Pre-Release Test Drive.  The test drive is a trial download of our DMX-h ETL software.  We have installed our software on our partner Cloudera’s VM (VMware) image complete with the user case accelerators, sample data, documentation and even videos.  While the download is a little large ─ ok it’s over 3GB─ it’s a complete VM with Linux and Cloudera’s CDH 4.2 Hadoop release (the DMX-h footprint is a mere 165MB!).

Test Drive visual for Kohl blog 052813

 

The use case accelerators allow users to get going quickly, not only with DMX-h ETL but also with the test drive itself.  We’ve included use cases that we hear consistently: how do I identify changes in 2 different data sets (change data capture), aggregating web log data, translating and loading mainframe data into HDFS, and more.

So that you can actually use the use case accelerators, we have included sample data, comprehensive documentation and even videos.

The test drive is not your normal download.  This is actually a pre-release of our DMX-h ETL product offering.  While we have announced our product, it is not generally available (GA) yet…scheduled for end of June.  We are offering a download of a product that isn’t even available yet…how many vendors do that?!

The test drive is not only about our users’ experience. While this is the primary focus, we also want to hear back so our users’ can influence our product.  We are using our community to ask questions but also for users to provide feedback on their test drive.

The DMX-h ETL Pre-Release Test Drive has been live for just over a week, and we have had literally hundreds of downloads.  Join the crowd and take the test drive.

{ 0 comments }

Is this blog about yet another TeraSort benchmark?

Benchmarking is a very integral part of our life at Syncsort since the products we develop are highly focused on performance and scalability. Having said that, I often find published benchmark results well crafted. This blog is not about a single benchmark data point over thousands of nodes. It is about a set of Hadoop benchmarks focused on TeraSort and the most common Extract-Transform-Load (ETL) use cases. While it’s great to show results on thousands of nodes, I believe it is also important to show benchmark results in a more common real-world configuration. Based on what we have seen with our customers and on the data published in Apache Hadoop Wiki “Powered By Hadoop” page, we used a 10 node cluster for this particular test of our new products.

We are very excited about our new product releases, DMX-h Sort Edition and DMX-h ETL Edition. These products deliver high performance and scalable data integration with ease-of-use on Hadoop.

Now, let’s look at DMX-h Sort Edition first: DMX-h Sort delivers an alternative sort implementation for MapReduce Sort Phase. During the MapReduce data flow, the data is sorted using Syncsort’s sort algorithms instead of the native sort. Integration of DMX-h Sort is seamless; you can configure either a particular job or all jobs running on the cluster to use DMX-h Sort when possible. We ran the TeraSort benchmark with DMX-h Sort.

In today’s blog, I will focus on the TeraSort results. The tests are run with CDH 4.2, simply because this distribution is the first to include Syncsort’s contribution to the Apache MapReduce project, mapreduce-2454, in a generally available release.

Cluster configuration:

-        (10 + 1 + 1) nodes with 12 cores – Intel Xeon X5670, 2.93Ghz

-        Memory: 96GB per node

-        Disk drives: 12 x 3 TB 7200 RPM; I/O speed: 110 MB/Sec write and 140 MB/Sec read

-        HDFS block size = 256 MB

-        MapReduce version 1

The chart below displays the TeraSort benchmark results, elapsed time (clock time) to run TeraSort with native sort versus with DMX-h Sort. Map output compression is not enabled for this set of tests.

Terasort Benchmark 1

As you can see, the percentage gain in elapsed time increases as the input data size grows. With DMX-h Sort, the performance gain against the native sort scales out, from 35% for 0.5TB to over 55% for about 2 TB of data, i.e. more than 2x faster.

Let’s look at the amount of data that is being processed per unit of time per node, i.e. Megabytes of data processed per second per node (MB/sec/node). As the data size changes, the amount of data that is processed using DMX-h Sort remains constant, whereas the amount of data processed by native sort per second per node drops.

Terasort Benchmark 2

What does this mean?  Basically, you can process more data per node by using DMX-h Sort, and this is without adding more nodes to the cluster as DMX-h helps to scale within each node. Processing more data per second per node implies cost savings both for CAPEX and OPEX; you don’t have to increase your cluster size every time there is a significant jump in the data growth.  If your cluster is deployed on the cloud, this implies less usage and more cost savings.

In this particular test, we are setting a baseline with the standard TeraSort benchmark to demonstrate the benefits of using DMX-h Sort versus native sort. In Part II of this blog, we will focus on typical use cases for ETL, web log aggregation and Change Data Capture (CDC). Stay tuned…

 

{ 0 comments }