I recently watched a video blog from ESG Senior Analyst Lauren Whitehouse, which she posted from SNW in Orlando. In the video, Lauren discusses a presentation she made on deduplication and makes several interesting observations that really got me thinking.
First, she was surprised at how many people were in attendance, given that deduplication is yesterday’s news to an extent. Plus nearly everyone in the audience was already using dedupe technology. So why were they there?
“People are curious about what’s next,” Lauren says. And the discussion has changed. I had to grin when Lauren said, “Just a couple of years ago… the big discussion was about in-line vs. post-process,” because I remember those arguments really well — I was making them myself (with a previous company).
Those were the days when Data Domain was staking out the in-line turf with an argument that was basically “in-line is the only way, the rest of you are stupid.” Then they got bought by EMC and became a little arrogant according to some.
Now that the dust has settled, most vendors offer a mix of modes and the real answer to the argument is “it depends,” which is exactly what the smarter observers were saying at the time (I recall Curtis Preston being one of them).
In any case, the world has moved on and the discussion along with it. Lauren notes that there are so many options now, so many different places to deploy dedupe. It’s not just disk targets anymore. Lauren notes: “When you look at the number of solutions that are available: hardware, software, primary storage, backup storage, to the cloud, there’s just so many different things that have to be evaluated. It’s confusing.”
Confusing it is. And that’s one reason we’ve been focused on simplicity and completeness with NetApp Syncsort Integrated Backup (NSB). It may seem a bit of a paradox – don’t you get less complete as you get more simple? Not necessarily. Completeness is about having all the things you really need, while simplicity is about making them easy to use and as transparent as possible. Ideally, you strike a balance between the two.
With NSB, we’ve taken the notion of data reduction and inserted it across the backup process. Note that I say “data reduction.” Deduplication is a specific technology approach, while data reduction is the goal. NSB starts the process at the server by using a block level backup method that’s designed to never read and copy the same data twice (we use our own technology with our Agents, and leverage VMware Changed Block Tracking when using Agentless backups). This gets you the data reduction without the impact of deduplication, which relies on reading all your data, crunching a bunch of hashes and comparing them. And then doing it again the next time you back up. Dedupe at the server doesn’t make sense. You are stealing resources and creating impact in the last place you want to do that.
Target dedupe is much more viable because you’ve got hardware designed to handle the load. NSB puts the deduplication where it belongs, at the disk target. Though since NSB creates little duplicate data you don’t get very high dedupe rates. Most of the work is already done. The reason you see 95%-plus dedupe rates from other approaches is that you dump so much duplicate data into your target only to get rid of it. What a waste of effort. You can learn a lot more about this topic in our Beyond Deduplication white paper if you are interested.
So does dedupe still matter? Of course it does. Data reduction is what makes disk-to-disk backup economically feasible, and deduplication is part of the data reduction process. Not all of it, necessarily, but part of it.
Lauren Whitehouse makes the point that in the last few years data reduction has moved down in terms of IT projected spending priorities, but it’s still solidly in the top ten initiatives. It’s becoming a standard part of IT. As she notes, “Data growth is not stopping. It’s a continuous pain point for everyone. So I think it’s going to continue to be a high priority just in the face of how do I deal with all this data?”
We feel the best way to deal with data growth is comprehensively, from the server to the target. Squeeze out efficiencies wherever you can, in a way that minimizes impact and resource consumption.