March 2011

I am the first to admit that there are some really bad webcasts that IT vendors have created over the years (never any I have been involved with, of course!). However, every so often I come across one that has the right mix of educational material and practical advice to make it worthy of the time investment to watch. Now that I’m a blogger, I have a forum to endorse such webcasts!

Our friends at NetApp recently sponsored a webcast with Gartner on “Next Generation Backup and Recovery for Virtual Environments.” Unlike most webcasts that immediately jump to the sales pitch, this one starts off with a great overview from Gartner analyst Dave Russell on the broken state of the backup market today. Dave does a really effective job of exploring some of the key drivers of change (data growth, tougher SLAs, virtualization, etc.) that are preventing many organizations today from meeting the recovery expectations of the business. If only for Dave Russell’s portion, the webcast is worth tuning into (did I mention it is on demand too so you can watch it whenever you want).

While admittedly a bit biased (I do make my living spreading the word about NSB after all), I also found the two other segments of the webcast to be really well done. In the middle segment, Syncsort VP of Channels Mike Kuehn and NetApp Director of Data Protection Solutions Mark Welke talk about how legacy backup architectures are not keeping pace with the needs of today’s 21st century data centers. They also plug NSB (tastefully done may I add) and talk about how it goes beyond deduplication and storage reduction to solve a wider range of issues for customers across physical and virtual servers. All good stuff, I promise!

The webcast really saves the best for last though. To me, there is little more powerful than when a customer opens up and talks about their real-world experience with using a product or solution (curious that it doesn’t always align with what is in a vendor’s marketing materials, don’t you think?). Bob Scott, the IT Manager for Community Health Center of Snohomish County (Washington), talks about how they had outgrown their existing solution and needed something that could help them with the consistency and usability of their backups as well as help with their need to reduce the administrative time required. NSB is just what the doctor ordered (pun intended) and the ease of use in the management console enabled a quick ramp up time and made it possible to spend more time focused on more strategic initiatives in other production environments. Music to my ears though was hearing Bob speak at the end about how NSB is helping the CHC serve the community better by giving doctors better tools in order to give patients better care. Now that is what technology should be about!

{ 0 comments }

Last week I was reading a discussion thread that somehow resurrected an old debate about ETL vs. ELT. In other words, Should the “T” live in the database or on the “ETL” tool?  Several arguments were made in favor of pushing the “T” down to the database. Later that week, I got a similar question from the audience during a TDWI Webinar with Claudia Imhoff:

Databases are better at set based operations than record-at-a-time ETL tools. Therefore, Why not use the database for ETL?

To better explain this, I called my colleagues Dave Nahmias and Mike Wilkes to the rescue, so I would like to share some of their thoughts in this post.

Yes, SQL is based on set theory. Its original goal was to limit the result set, allowing users to retrieve specific information from the database. “Prior to relational databases, users were forced to read through volumes of green bar reports to get to what interested them.  SQL allowed users to select only the information they wanted to see,” says Dave. 

Traditional RDBMs implemented indexes as means to accelerate this process by quickly locating and combining large lists of records. Once the database arrives to the desired answer, it retrieves the data and presents it to the user. According to Dave, it is exactly in this last part of the process where the problem resides. “Since the goal of a query was usually to narrow the results, the amount of data returned was usually small, so the inefficiencies of data retrieval were hidden.  For example, I just ran a test than joined a 24 million row table with a 6 million row table in Oracle.  Producing a count of records from the join took about 30 seconds, since this primarily involved processing the indexes.  When I added columns to the select and piped the output to NULL (to avoid the cost of writing to disk), the process took over 5 minutes.”  Dave explains that unlike user queries, a typical ETL process usually results in millions of records delivered to another table and very often most of the columns are part of the output.

I guess by now you’re starting to get the picture… Who said staging data was best practices? Today it seems more like a really expensive workaround.

This is when Dave brings up the good news. “With DMExpress, there is no need for a staging area, since it can join sources at near – and sometimes better than – indexed speeds.  This eliminates the need for terabytes of storage, and removes nightly loading, truncating and indexing from the process.”  Now, that’s clever!  Eliminate staging to reduce otherwise increasing database and maintenance costs.

So maybe it’s time to address the second part of the question, the part that talks about record-at-a-time ETL tools.  Yes, most ETL tools process one record-at-a-time. “Luckily we are not them” states Mike with his usual confidence. “We use direct I/O and read data in large, sequential reads.  Then we pull it into memory buffers.  We thread to keep those buffers flushed.  We are operating on the data in memory assets.” Now, that is fast!

{ 1 comment }

What is Fast?

March 7, 2011

Recently, I had the chance to join my colleagues Steve Totman and Nikhil Kumar for a great conversation with Philip Howard, research director from Bloor Research, about Syncsort’s technology and the reasons why DMExpress is so fast. We always appreciate the opportunity to speak with someone like Philip – with his impressive 30+ years of experience in the data management world – about our “secret sauce” and love it even more when it seems to leave a strong impression!

But, what is fast? In our world fast means the ability to process large data volumes in less time and with less resources. How fast? Well, DMExpress can process data as fast as native I/O speed, which pretty much means there’s nothing faster.

But the real question is, what does that mean to me? Because in the end, it’s all about doing more with less, right? In the end is about costs, about business agility and being able respond faster to new demands for information.

Faster means you can offload your database by performing transformations on your ETL tool. Faster means you can spend less money on additional database capacity and more on supporting new initiatives. Faster means you get extra time to add new data sources. Faster means you spend less time fine-tuning and more time developing new reports. Faster means you can outpace the competition with information that is timely, relevant, and actionable.

So, how is DMExpress able to accomplish all these? Well, I think Philip Howard does an excellent job explaining this. You can read his article, “How Come Syncsort is So Fast and What Does That Mean?,” on Bloor Research’s website.

{ 0 comments }

There was quite a lot of fuss last week about the Gmail outage that caused about 30,000 Gmail users to lose all their mail, contacts, etc.  30,000 sounds like a lot, and it is, but it represents only 0.02% of Gmail users.

The problem seems to have stemmed from a “storage software update.” On the data protection front, a sub-story emerged when Google said on in a blog post that it was working on restoring data from tapes.

Oh mercy! The comments flew. How could Google be using tape of all things! Of course they are using tape, others replied, it’s still the best way to store data! On and on it went, and it wasn’t always polite. There was a lot of name calling, especially in the comments sections of various blog posts.

I was thinking of commenting myself, until I came across an excellent blog post by Storage Switzerland’s George Crump at InformationWeek. He writes “What We Can Learn from the Gmail Crash,” and I think he covers the main lessons that came out of this event and does so without taking sides or calling the other guy stupid. I recommend it for a sensible look at an event that reminds us yet again that sooner or later data loss issues hit everybody.

{ 0 comments }