Backups and Big Data

June 14, 2011

Over at InformationWeek, data protection guru George Crump has a new post “Big Data a Big Backup Challenge” that’s worth your time if you live in the Big Data world.

Big Data is near and dear to our hearts at Syncsort, where performance has been the core of our corporate DNA for over forty years. My colleague Jorge Lopez addressed it recently on the data integration side of our business  and I’m going to share a few thoughts from a data protection perspective based off George Crump’s column.

George starts by noting that “Big Data is… a backup application’s worst nightmare because many Big Data environments consist of millions or even billions of small files. How do you design a backup infrastructure that will support the Big Data realities?”

Indeed, traditional backup has a huge problem with millions of files because traditional backup sees the world at the level of the file. That means that whenever a file changes, even a little, it gets backed up again. And worse, the system has to continually scan through all those files to figure out what’s new.  Even host-side deduplication products suffer here because they also have to contend with every file that changes.

With NetApp Syncsort Integrated Backup (NSB), you avoid these problems because of our design strategy of “never read or move the same data block twice.”  And the key there is not so much “move” as “read.”  When it comes to Big Data, NSB is like the hedgehog that knows one big thing. And that big thing is do not scan the data!

NSB uses a change journal system, where updated blocks are tracked at a level below the file system, and they are tracked as they are written. This means we keep a constantly updated log of what blocks require backup, and this logging process creates so little system impact that it can hardly be measured.  This is so much smarter than having to hash through the same data over and over again.

George continues by making a lot of sensible recommendations on how to deal with Big Data, including using both disk and tape (each serves its purpose), having proper dedupe and compression,  and perhaps most importantly, identifying what really needs to be protected and what can be re-created if needed.

Big Data is here to stay. While not everyone has Big Data issues today, ask those who are suffering and I guarantee you they will recommend getting out in front of the problem before it is too late.

{ 0 comments… add one now }

Leave a Comment

Previous post:

Next post: