Data Integration

When we announced our DMExpress Hadoop offering, we shared a set of results from benchmark testing that had been completed. Testing has continued since, and I wanted to dedicate this post to sharing some of those results.

We did a series of tests that distill down to:

  • TeraSort benchmark (if you’re not familiar with this benchmark, it is worthwhile to search it online)
  • Aggregation based on TPC-H generated data (aggregated on order id for line item data)

We varied two things in the tests:

  • Compression in the shuffle step: no compression, GZIP
  • Data volume: We ramped up to 4TB on the TeraSort and 600GB on the Aggregation

The tests were done on a 10-node cluster running CDH3u2 (Apache 0.20.2).

The results were very interesting, but not surprising.  For TeraSort:

  • No compression:
    • While DMExpress was faster for smaller data volumes (under 1TB), the elapsed times were still small – 15.12 minutes for native sort vs. 11.93 minutes with DMExpress for 500GB
    • When you pump up the data volumes, DMExpress really outperformed the native sort – 240.48 minutes for native sort vs. 144.18 minutes with DMExpress for 4TB.  That’s a 40% improvement and nearly 2x faster.  That was consistent for 1TB and 2TB, as well
    • GZIP compression, the results were consistently 2x or more faster:
      • 20.82 minutes for native vs. 8.98 minutes with DMExpress for 500GB
      • 223.82 minutes vs. 84.72 minutes with DMExpress for 4TB, more than 2x faster!

For the Aggregation, we wrote the same aggregation logic in Java, Pig and DMExpress (a key benefit with DMExpress is using a GUI rather than coding, but this post is focused on performance). The compression results were consistent across the board with the non-compression results, so I will just give you the results using GZIP:

  • 150GB of data
    • Java: 2.4 minutes
    • Pig: 2.92 minutes
    • DMExpress: 1.18 minutes
    • 600GB
      • Java: 7.89 minutes
      • Pig: 11.15 minutes
      • DMExpress:  4.07 minutes

DMExpress is nearly 2x faster vs. Java, and consistently more than 2x faster than Pig.

What’s that mean for you? It means that you can do more with less nodes, which has implications for the CapEx and OpEx associated with it. Simply stated, you can process more data with the cluster you already have available. If you happen to be running on a public cloud, faster processing times also mean less usage time.   

If you have any questions or want to learn more, please feel free to leave a comment.

{ 1 comment }

Earlier this year, I kicked off the “proof is in the pudding” blog series as a way to share results that DMExpress is achieving during proof of concepts (POCs) in real customer environments. The idea is to wow the loyal readers of the Syncsort blog with information about DMExpress’ speed, efficiency and ease of use.  

It has been too long since I contributed to the series, but I promise to start posting more frequently. We’ve got a lot of exciting work going on behind the scenes and plenty of information to share.

For this post, I want to focus on a recent POC involving a customer that was running up against their nightly batch window. If there was any failure at all during the evening, the customer would not be able to refresh the data warehouse leaving business users with data that is 24+ hours old. This was simply not acceptable to the business and we knew that DMExpress was just the right solution for the job.

For this POC, the environment consisted of a four-core UNIX box with ETL coded in PL/SQL (while that’s really ELT, please forgive the semantics for right now).  Another challenge this customer had was that this particular ETL flow involved nearly 900 lines of PL/SQL which was incredibly complex and nearly impossible to maintain. In fact, they really only had one person capable of maintaining it. What happens if he goes away? Hopefully this doesn’t sound too familiar to you!

The stated goal of the POC was to reduce elapsed processing time by 33%. Additionally, we were looking to demonstrate that DMExpress could significantly reduce the complexity of building and maintaining the ETL.

The particular job involved 5 data sources, identifying changed records, performing multiple joins, enhancing the information via lookup, and loading the database.  The POC ran on approximately 350,000 records, a relatively small amount of data. However, as you are about to find out, the results were quite impressive!

The original process was taking 90 minutes, so the 33% reduction that the POC targeted meant that we had to reduce it to 60 minutes. How did DMExpress do? How about only 6 minutes! That’s a 15x improvement in throughput and 93% reduction in elapsed time for those of you keeping score at home.

How about the 900 lines of PL/SQL? We took that and converted it into just 2 DMExpress jobs, now built and able to be maintained in a simple, easy-to-use graphical user interface. Needless to say, the customer was impressed.

Stay tuned for more results in the days and weeks ahead. In the meantime, don’t be shy about posting comments and questions.

We are also still open to taking on any challengers willing to put their solutions up head-to-head versus DMExpress in a benchmark. Of course, with results like the ones I’ve shared above, I guess it’s not a big surprise that we haven’t had any takers on that just yet…

{ 0 comments }

Gartner Business Intelligence SummitEarlier this week, I spent three great days attending Gartner’s Business Intelligence (BI) Summit in Los Angeles. All the usual suspects presented on a variety of topics ranging from Big Data to cloud computing to mobile. However, this year felt a bit different to me. There seemed to be a realization that BI (and related) technologies alone do not represent the road to perfection and information nirvana.

Instead, what I observed was organizations being careful about how they leverage the hot trends of the day. They seem to recognize that they must carefully watch the technology evolution, understand the associated risks and opportunities, and only then determine how to incorporate them into plans for supporting the business.

Here are some personal takeaways from the conference, in no particular order:

  • Big Data means big noise.  As companies start to analyze Big Data, the amount of noise grows exponentially. In fact, some studies estimate the noise level to be greater than 70%. Therefore, the challenge becomes how to efficiently and effectively process all the data while filtering out the noise. As Gartner analysts mentioned, organizations need to be very careful not to add bad data in their quest to leverage Big Data. I agree.
  • Information is about connecting the dots.  Once we’ve filtered out the noise, we have to connect the dots. Raw data by itself has marginal value. Connecting the dots enables us to convert data into information, adding tremendous amount of value along the way. For instance, having comprehensive data about suspected terrorists has nearly no value if we can’t intelligently connect the dots to unmask their network and predict the next move.  Data Integration plays a key role as the first line of defense not only to integrate the myriad of sources of information, but also to do so in a timely fashion.
  • The decision environment has evolved. Instead of only the strategic aspect, the decision environment now also includes the management and operational aspects. This results in new requirements in terms of velocity, variety and volumes of data. For instance, operational workers need near real-time data at the lowest level of detail while strategists may look at weekly, monthly, even yearly trends of aggregated data. No wonder Gartner predicts that by 2014, most organizations will not scale to meet the requirements of Big Data! Now think about what happens if you have underperforming data integration tools pushing transformations down to the database. Performance clearly has a huge impact across the entire organization.  
  • Balancing resources is a daunting, but critical task. Fortunately, this is not the case for Syncsort. At the event, Mark Beyer did a great job highlighting this challenge. In the era of Big Data, it’s more important than ever to balance resource utilization – that is CPU, memory, storage and I/O. Workloads are competing for all these resources and the variables are not static. Organizations have different workloads on different months, weeks, and days of the year. IT organizations are being forced to either “oversize” their systems or leave constant tuning/optimization cycles while living in constant fear of failure at “rush hour.” This is why having a highly scalable, self-tuning engine like DMExpress is so powerful.
  • Volume is (not) a “20 mules problem.” This is a funny yet interesting analogy. Basically, it means that you could just throw more cores (mules) to parallelize a given data processing job and get it done. Of course, it’s not that simple. I would argue that with more mules also come more issues such as what to feed the mules, how to house them, clean them, keep them healthy, etc. Therefore, you might as well keep the number of mules (or cores) to a minimum! Again, this is another area where DMExpress is highly differentiated in the marketplace.

I’ve been working on shortening my blog posts. It seems that “attention span thing” always gets in the way of everything I want to share! However, if you’ve made it this far, surely you are wondering what Skynet has to do with any of this.

Well, for the first time at a BI conference (and I’ve been to many of them), I observed a subtle, but legitimate concern among attendees about how information – and more specifically algorithms and automated decision making – are shaping our lives and culture. To paraphrase Kevin Slavin, algorithms are shaping the way we live, what we read, what we write, what we consume.

Don’t believe me? Think about the last time you bought a product on Amazon, selected a movie from Netflix, or found a business through Google search.

What do you think? If you attended Gartner BI Summit, what were your takeaways? How is Big Data impacting you? Let’s keep the discussion going.

{ 2 comments }

As a disclaimer, I should point out that I have been working with very large data for all of my working life and am extremely passionate about it. In fact, I was on the team that ran the first 1 terabyte, non-extrapolated ETL benchmark 10 years ago.

However, if I’m being completely honest, I must confess that all of this talk about Big Data (including from yours truly on the Syncsort blog) has me increasingly thinking that enough is enough. Suddenly, every company is now a Big Data company. It wouldn’t shock me to find a furniture company at the next tech industry tradeshow selling special reinforced Big Data storage cabinets!

Like kissing in the school yard, Big Data is the topic that everyone is talking about but very few are doing well (if at all). CEOs and CIOs everywhere are being bombarded with messaging that makes it sound like their businesses are about to grind to a halt if they don’t redirect significant portions of their budgets to this “new” area of focus.

As an aside, my first thought for the title of this post was “Has Big Data Jumped the Shark?” before I recalled a post by a similar name from Curt Monash.  In addition to being a very good post, I loved Merv Adrian’s quote towards the end about it being Crocodile Dundee’s job to determine what is and isn’t Big Data. That said, I have grown to quite like the title that I landed on for this post.

I must admit that I do love what Big Data has done for my social street cred. It used to be that data geeks like me, with our vampire tans from being in the data center all day (somehow made worse for a Brit like me), used to be mocked. Can you believe that! Now we are data scientists that are in high demand and can earn fortunes. I have even wondered if my experience in this space will one day lead to me being called a “Big Data professor.” But I digress and it is time to get back to business…

Recently, I met with two very smart (and very talented) executives looking for guidance on how to stop their company’s “impending destruction” at the hands of Big Data.  I naturally tried to share some pearls of wisdom, but what really struck me was that it took a simple name, “Big Data,” to make all this stuff sexy. Data didn’t just become Big Data overnight. One could argue that it has always been that way! Even before I was born, Syncsort was helping customers address the challenges of handling very large data volumes to save money.

So, I’m curious. Is it just me that’s thinking the term Big Data is starting to get so overhyped that it could eventually become meaningless? Is “Big Data” poised to be simply called “data” again? Leave me a comment with your thoughts.

Regardless, I love the fact that data and all the plumbing around it are finally sexy. If this keeps up, it will only be a matter of time before we will all be able to go to a spray tan shop and get the “data scientist special.”

{ 2 comments }