data volumes

When we announced our DMExpress Hadoop offering, we shared a set of results from benchmark testing that had been completed. Testing has continued since, and I wanted to dedicate this post to sharing some of those results.

We did a series of tests that distill down to:

  • TeraSort benchmark (if you’re not familiar with this benchmark, it is worthwhile to search it online)
  • Aggregation based on TPC-H generated data (aggregated on order id for line item data)

We varied two things in the tests:

  • Compression in the shuffle step: no compression, GZIP
  • Data volume: We ramped up to 4TB on the TeraSort and 600GB on the Aggregation

The tests were done on a 10-node cluster running CDH3u2 (Apache 0.20.2).

The results were very interesting, but not surprising.  For TeraSort:

  • No compression:
    • While DMExpress was faster for smaller data volumes (under 1TB), the elapsed times were still small – 15.12 minutes for native sort vs. 11.93 minutes with DMExpress for 500GB
    • When you pump up the data volumes, DMExpress really outperformed the native sort – 240.48 minutes for native sort vs. 144.18 minutes with DMExpress for 4TB.  That’s a 40% improvement and nearly 2x faster.  That was consistent for 1TB and 2TB, as well
    • GZIP compression, the results were consistently 2x or more faster:
      • 20.82 minutes for native vs. 8.98 minutes with DMExpress for 500GB
      • 223.82 minutes vs. 84.72 minutes with DMExpress for 4TB, more than 2x faster!

For the Aggregation, we wrote the same aggregation logic in Java, Pig and DMExpress (a key benefit with DMExpress is using a GUI rather than coding, but this post is focused on performance). The compression results were consistent across the board with the non-compression results, so I will just give you the results using GZIP:

  • 150GB of data
    • Java: 2.4 minutes
    • Pig: 2.92 minutes
    • DMExpress: 1.18 minutes
    • 600GB
      • Java: 7.89 minutes
      • Pig: 11.15 minutes
      • DMExpress:  4.07 minutes

DMExpress is nearly 2x faster vs. Java, and consistently more than 2x faster than Pig.

What’s that mean for you? It means that you can do more with less nodes, which has implications for the CapEx and OpEx associated with it. Simply stated, you can process more data with the cluster you already have available. If you happen to be running on a public cloud, faster processing times also mean less usage time.   

If you have any questions or want to learn more, please feel free to leave a comment.

{ 1 comment }

ESG’s Steve Duplessie has a great new blog post this week titled IT Chasms, Gaps and A New World Order.  Featuring Steve’s classic, straight shooting style, it is well worth your while to give it a read. It focuses mostly on networking (the kind with routers, not meeting people for a drink), but he makes a very interesting point about storage that I think are important and want to explore further.

After discussing how important it is for vendors to help customers develop applications faster, Duplessie says this:

The bigger truth is telling a storage buyer that your stuff is awesome because he can go faster running VMware is cool, but telling the App owner that your storage features will enable them to cut test and Q/A time by 30% is where the money is.

Hats off to that!  Steve is dead-on here. And one of the ways to do this – I would argue the best way – is by using your backup storage.

Let’s step back a bit.  Normally, when you hear vendors talking about using storage for test/dev tasks they start talking about snapshots and clones, and that usually means doing this with your primary storage.  Does it work? It does, but there’s a price to pay. 

First, primary storage is expensive, and using up high-speed disk resources for tasks that do not require high-performance is spending money you’d rather not spend. Second, it impacts performance.  Many disk array snapshots create quite a bit of impact on performance because the copy-on-write model means two writes and one read every time a block is written. To provide a hypothetical example, if “Barry the Unruly Developer” wants to do a lot of test/dev work off your primary disk, you risk serious impact to production performance.  

If you happen to use NetApp for your primary storage, you happily avoid this performance penalty because not all snapshots are created equal. But what if you don’t have NetApp primary storage?

That’s where NetApp Syncsort Integrated Backup (NSB) can help.  NSB lets you back up from any primary storage environment to a NetApp FAS device.  When NSB captures data, it stores it using NetApp Snapshots. And guess what? You have full access to cloning capabilities. The benefits of this are many.

1.  Everything is running on secondary storage. That means low-cost SATA drives with loads of capacity.

2. Everything is running on secondary storage. That means that no matter how many clones you spin up, no matter how hard “Barry the Unruly Developer” bashes away at the system, the impact to your production environment is zero, as in none whatsoever!

3. Everything is running on secondary storage.  That means it’s all consolidated onto a single hardware platform, no matter what mix of primary disk you have. It even protects boot drive data that’s not on a SAN, so Barry has access to all the application information, not just the data volumes.

4. Everything can also run on tertiary storage. Just use SnapMirror replication to send your backups to a DR site, and you can do all your test/dev over there.

5. It’s all super easy. NSB overlays the NetApp Snapshot and FlexClone features with super-simple workflows.  That means the person dishing out the storage to the test/dev folks doesn’t have to know how a NetApp FAS works.  How many steps does it take to provision a 2 TB SQL database volume clone to a dev?  A couple of mouse clicks. You can see how it’s done here

6. It’s physical. It’s virtual. It’s virtu-physical!  NSB can take any backup from any server and boot it from a FlexClone into a new VM in about ten minutes start to finish. That’s right.  When “Barry the Unruly Developer” demands a SQL Server instance to work on, you can say “Ten minutes Barry!”  And ten minutes later Barry has a brand new VM running he can play with all he likes. All running off a FlexClone, using zero extra storage footprint. And running – did I mention this? – from secondary storage or even tertiary, if you’d rather have Barry as far away as possible! To see how this works, click here

I could go on, but I think you get the idea. We have users doing this every day, leveraging their backup data for tasks beyond recovery: development, testing, data mining, reporting, even virus scanning. Anything you want to do that requires copies of your data and you would prefer to off-load from production hardware.

Saves time. Saves money. So easy that your most inexperienced IT person can be designated as “the guy that Barry gets his data from.”  (And not to worry inexperienced IT person – you can schedule NSB to deliver Barry his data every day, automatically).

It makes you smile. It makes Barry smile. What’s not to love?

{ 0 comments }

As a disclaimer, I should point out that I have been working with very large data for all of my working life and am extremely passionate about it. In fact, I was on the team that ran the first 1 terabyte, non-extrapolated ETL benchmark 10 years ago.

However, if I’m being completely honest, I must confess that all of this talk about Big Data (including from yours truly on the Syncsort blog) has me increasingly thinking that enough is enough. Suddenly, every company is now a Big Data company. It wouldn’t shock me to find a furniture company at the next tech industry tradeshow selling special reinforced Big Data storage cabinets!

Like kissing in the school yard, Big Data is the topic that everyone is talking about but very few are doing well (if at all). CEOs and CIOs everywhere are being bombarded with messaging that makes it sound like their businesses are about to grind to a halt if they don’t redirect significant portions of their budgets to this “new” area of focus.

As an aside, my first thought for the title of this post was “Has Big Data Jumped the Shark?” before I recalled a post by a similar name from Curt Monash.  In addition to being a very good post, I loved Merv Adrian’s quote towards the end about it being Crocodile Dundee’s job to determine what is and isn’t Big Data. That said, I have grown to quite like the title that I landed on for this post.

I must admit that I do love what Big Data has done for my social street cred. It used to be that data geeks like me, with our vampire tans from being in the data center all day (somehow made worse for a Brit like me), used to be mocked. Can you believe that! Now we are data scientists that are in high demand and can earn fortunes. I have even wondered if my experience in this space will one day lead to me being called a “Big Data professor.” But I digress and it is time to get back to business…

Recently, I met with two very smart (and very talented) executives looking for guidance on how to stop their company’s “impending destruction” at the hands of Big Data.  I naturally tried to share some pearls of wisdom, but what really struck me was that it took a simple name, “Big Data,” to make all this stuff sexy. Data didn’t just become Big Data overnight. One could argue that it has always been that way! Even before I was born, Syncsort was helping customers address the challenges of handling very large data volumes to save money.

So, I’m curious. Is it just me that’s thinking the term Big Data is starting to get so overhyped that it could eventually become meaningless? Is “Big Data” poised to be simply called “data” again? Leave me a comment with your thoughts.

Regardless, I love the fact that data and all the plumbing around it are finally sexy. If this keeps up, it will only be a matter of time before we will all be able to go to a spray tan shop and get the “data scientist special.”

{ 2 comments }

I was on the underground in London last week on my way back from visiting a financial services customer when I heard a couple of well dressed gents carrying brollies (is it only us Brits that leave the house assuming it will rain no matter how nice it is outside?) musing over the old adage that the only thing you can count on is taxes, death and trouble (as captured in this Marvin Gaye song).

Their conversation got me thinking that instead of trouble, there is actually another thing you can rely on today – that data is only going to get bigger.  I would argue that the amount of useful information to be gleaned from this data is not growing at the same exponential rate. However, regardless of whether you consider your data ‘Big Data’ or not, you actually have to do a lot more “work” to your data as it grows to get business relevant and valuable information from it.

A good example of this is close at heart to those of us impacted by the Eurozone (I’m intentionally avoiding the long debate as to if the UK is actually a member given we’ve kept our own currency but are paying to support the euro). The financial crisis worldwide caused the rapid acceleration of new regulations and controls on markets and companies. In Europe, we already had Solvency II and Basel I, Basel II and now Basel III. These regulations are getting incredibly complex.

Calculations on “extreme” data volumes are required to remain compliant and keep senior executives from going to jail. In this case, picking the right ETL tool can be like receiving a “get out of jail free” card in Monopoly.

So why are the calculations required so complex? For starters, here in Europe we love them as evidenced by European Commission regulation (EC) 2257/94 which states – bananas must be “free from malformation of abnormal curvature.”  In the case of “extra class” bananas, there is no wiggle room but “class 1” bananas can have “slight defects of shape” while “class 2” bananas can have full-on “defects of shape.” Yes, that’s right. We have regulations about the shape and curvature of bananas and don’t even get me started on cucumbers (Commission Regulation (EEC) No 1677/88), where “class I” and “extra class” cucumbers are allowed a bend of 10mm per 10cm of length. Class II cucumbers can bend twice as much. So you can imagine how detailed our calculations must be for something like risk!

About 2 years ago, I was heavily involved with a very smart team working on industry models. To keep up with them, I decided I had to read and understand the Basel II regulations. All I will say is that whenever someone mentions they are working on a Basel project, it brings back horrible memories. I remember it being 4 a.m. on the first day of my “reading project” when I realised my brain hurt and that the scroll bar on the document didn’t look like it had moved. Tying this back to data integration, the point is that it’s definitely not just the volume of data that causes the problems for customers. More often than not, it’s the complexity of calculations or transforms they are dealing with.

Often when I’m speaking with people about data integration acceleration (a good example was the bank I visited earlier this week), they will respond that “our data isn’t really that big.” When pressed on how long it takes them to process their data and whether this satisfies the business, people usually pause and you can see the wheels turning in their head. This is regularly followed by an admission that they are in fact exceeding their service level agreements. The next question is to ask them how much data growth they are seeing and are they prepared for it. After an even longer pause, something like “we plan for 20 percent growth” (a commonly accepted average). However, I’ve heard numerous companies admit that actual data growth could range from 10 percent all the way up to 600 percent! But no one ever says their data isn’t growing. Inevitably, the conversation ends up focusing on how much time they spend tuning their existing environment, how much hardware they are buying, and how they have no better option than to push transformations into the database.

It is always a bit amusing and always very satisfying when the same people who were saying they don’t have ‘Big Data’ are suddenly advocating for why data integration acceleration is needed and makes a lot of sense. Instead of reminding them of what they said in the first place, I simply smile and mention the amount of money they will be able to save from it, as well.

Perhaps I should revisit the title of my post. Three things that are guaranteed are death, taxes and data breaking your data integration infrastructure. If you are already using DMExpress, you can forget about the last one since we have you covered. Everyone else, you are invited to have one less thing to worry about. The whole death and taxes things…we are sorry but can’t help there!

{ 3 comments }