May 2012

Innovation and what succeeds in the market is an endlessly interesting idea. I was reminded of this recently when I read a New Yorker magazine profile of Clayton Christensen, the business guru most famous for his work “The Innovator’s Dilemma.”  The profile extends beyond his work: it covers his family background, his battle with cancer, his religious faith, and more. In all, it is a fascinating and inspiring profile that I highly recommend.  At the moment, it’s behind a subscription wall, so if you have access you can get it here, or you can read it in the May 14, 2012, print edition.

Christensen’s notion of “disruptive innovation” applies across any industry. An interesting example is perhaps Christensen’s most famous “miss” about the iPhone, which he predicted would not succeed because it was just a fancy cell phone. What he realized later, after its phenomenal success, was that the iPhone was actually disruptive to laptops, not just to other cell phones. A great insight, albeit after the fact.   

All of this got me thinking about changes in the backup world in the past few years, particularly two disruptive technologies, deduplication and snapshots.

Deduplication first made its mark in the form of deduplication appliances, single-purpose devices that were highly disruptive to tape as a backup target.  Disk had long been used for backup, whether as plain disk or in the form of a VTL, but it remained a niche methodology because it was just too expensive. As a result, disk was limited to only a day or two of data retention, if used at all. Deduplication radically changed the economics by providing data reduction rates of 90% or more, which is another way of saying you could get potentially twenty times as much use out of the same amount of disk.  

It changed the face of backup as far as tape was concerned, but interestingly, deduplication was not disruptive to the backup process. Users started replacing tape drives with disk, but everything else stayed the same. In the end, deduplication appliances were disruptive to only a portion of the backup process at the very end of the line. They were evolutionary, not revolutionary.

Snapshots have the potential to be truly revolutionary because they disrupt the entire traditional backup process, changing it from end-to-end, not just at the final step in the chain. But even though snapshots have been around for a long time, they are still not the leading way to protect data, despite all their advantages of speed and performance.  A survey by UBM TechWeb (commissioned by Syncsort) showed only 25% of users made use of primary storage snapshots (you can get the full survey here).

Why the limited uptake? A few key reasons: 

  • Cost: snapshots are typically done on primary disk, which is expensive.
  • Performance: many disk arrays suffer significant performance degradation as snapshots accumulate.
  • Complexity of restore: snapshots are great at capturing data, but a lot of disk systems do not have convenient, easy-to-use workflows for recovering data, do not have a catalog, etc.
  • Limited retention time: because they are expensive, you normally can’t keep weeks or months of data on snapshots.

Maybe this is why snapshots haven’t been as disruptive to traditional backup as might have been expected. So are snapshots destined to remain a limited use option, typically relegated to tier-1 applications and short retention times?

Not at all! There’s a disruptive technology in town now, and it’s called NetApp Syncsort Integrated Backup (NSB).  How does NSB change things?  It is quite simple. NSB takes the snapshots off the primary storage and puts them onto secondary storage, and then overlays it with easy recovery work-flows and a catalog. This seemingly simply change in the design solves all of the key reasons listed above for limited uptake.

I’ve written about this before here if you’re interested in more specifics.

For now, I will conclude with a concept from Clayton Christensen, who refers to the process of consumer product selection as people looking towards a way for “jobs to be done.”  Simply put, people don’t want products, they want to get something accomplished. The IT world is no different. None of us want backup software, really. What we want is for data to be protected and easily recoverable in a way that is cost-effective and reliable, and doesn’t demand too much of our attention. This is exactly what NSB delivers, as we heard recently from a user. It can do the same for you.

{ 1 comment }

When we announced our DMExpress Hadoop offering, we shared a set of results from benchmark testing that had been completed. Testing has continued since, and I wanted to dedicate this post to sharing some of those results.

We did a series of tests that distill down to:

  • TeraSort benchmark (if you’re not familiar with this benchmark, it is worthwhile to search it online)
  • Aggregation based on TPC-H generated data (aggregated on order id for line item data)

We varied two things in the tests:

  • Compression in the shuffle step: no compression, GZIP
  • Data volume: We ramped up to 4TB on the TeraSort and 600GB on the Aggregation

The tests were done on a 10-node cluster running CDH3u2 (Apache 0.20.2).

The results were very interesting, but not surprising.  For TeraSort:

  • No compression:
    • While DMExpress was faster for smaller data volumes (under 1TB), the elapsed times were still small – 15.12 minutes for native sort vs. 11.93 minutes with DMExpress for 500GB
    • When you pump up the data volumes, DMExpress really outperformed the native sort – 240.48 minutes for native sort vs. 144.18 minutes with DMExpress for 4TB.  That’s a 40% improvement and nearly 2x faster.  That was consistent for 1TB and 2TB, as well
    • GZIP compression, the results were consistently 2x or more faster:
      • 20.82 minutes for native vs. 8.98 minutes with DMExpress for 500GB
      • 223.82 minutes vs. 84.72 minutes with DMExpress for 4TB, more than 2x faster!

For the Aggregation, we wrote the same aggregation logic in Java, Pig and DMExpress (a key benefit with DMExpress is using a GUI rather than coding, but this post is focused on performance). The compression results were consistent across the board with the non-compression results, so I will just give you the results using GZIP:

  • 150GB of data
    • Java: 2.4 minutes
    • Pig: 2.92 minutes
    • DMExpress: 1.18 minutes
    • 600GB
      • Java: 7.89 minutes
      • Pig: 11.15 minutes
      • DMExpress:  4.07 minutes

DMExpress is nearly 2x faster vs. Java, and consistently more than 2x faster than Pig.

What’s that mean for you? It means that you can do more with less nodes, which has implications for the CapEx and OpEx associated with it. Simply stated, you can process more data with the cluster you already have available. If you happen to be running on a public cloud, faster processing times also mean less usage time.   

If you have any questions or want to learn more, please feel free to leave a comment.

{ 2 comments }