January 2012

Each January for the last 8 years, I have had the opportunity to escape the cold winter weather of the Northeast to visit warmer places like Las Vegas and Miami (where I currently am!) to catch up on the latest BI trends at MicroStrategy World. (Full disclosure, I’m an ex-MicroStrategist).

With more than 2,000 attendees, I can honestly say this year has been one of the most exciting, energizing events I’ve attended in a long time. IT professionals from all around the world are thrilled about the seemingly endless possibilities that market disrupting forces like Big Data, cloud, social media and mobile technologies are producing. When all is said and done, these things will have made a profound impact on the way we do business as well as the way we live, communicate and interact with our world. In fact, they already have!

During my many conversations with BI professionals here at MicroStrategy World and elsewhere, they often cite a fundamental challenge that is preventing them from fully realizing the benefits of their BI applications. That challenge is how to build a strong data integration infrastructure that enables IT to capitalize on the opportunities of Big Data. This is exactly where BI meets DI.

It is clear to me that a solid data integration infrastructure not only accelerates BI initiatives, but also helps maximize the benefits by making more data, that is accurate and relevant, available in much less time. However, this is a story that is much more powerful when told by one of our customers. For those of you in Miami at MicroStrategy World, don’t miss the chance on Thursday, Jan. 26 at 11:30 a.m. to hear directly from a leader in the healthcare industry. Part of the Big Data track, the presentation will be held at the Intercontinental Miami and will focus on using Syncsort’s DMExpress to reduce the cost and complexity of ETL for better, faster BI.

For the lucky ones in Miami, see you there! For everyone else, please feel free to leave a comment. I’m interested in your thoughts and experiences.

{ 0 comments }

The Proof is in the Pudding

January 24, 2012

The majority of technology sales, particularly in software, require some sort of proof of concept (POC) intended to prove out the product(s) based on a customer’s requirements. Syncsort is no stranger to POCs and we have a record of producing some really impressive results. Recently, we had the opportunity to present some of these results to a respected industry analyst. He suggested we share some of our POC results on a regular basis on the Syncsort blog. What a great idea!

In writing the first in what will be a series of posts throughout the year on POC results, I was inspired by my colleague and fellow Syncsort blogger Dave Nahmias. One of the phrases that those of us who work with Dave have no doubt heard him speak at one time or another is, “the proof is in the pudding.” We have seen time and time again situations where prospects are pleasantly surprised (and even amazed!) when they get their hands on DMExpress and an up close and personal look at just how fast, efficient and simple it really is to use.

Let’s start with a relatively straight forward POC performed on a Windows machine with 8 cores. Clearly this is not a large, powerful box. This will be important to keep in mind as I share the results. This particular job joined two data sources, performed two aggregations, and then loaded the data into SQL Server and Oracle as well as wrote to a compressed file.

Here are some of the specifics:

  • One of the data sources consisted of more than 100 million records (15GB of compressed data). The second data source was small (1,200 records). The reading of both files and the join took 2 minutes 25 seconds (about the same as the amount of CPU time). Only 35MB of memory was used!
  • The first aggregation took just under 19 seconds, and used only 3 of the cores and 50 seconds of CPU time.  This included the write to the compressed file and the load into Oracle.
  • The second aggregation took 40 seconds, using only 2 cores and 45 seconds of CPU!  This included the load into SQL Server.

Total job time: 3 minutes, 5 seconds!

So, what were we trying to beat?  How about almost 4 hours of processing that was running in the database!  Not only did we beat the times by orders of magnitude, the customer can now use a graphical interface to build and maintain the ETL.  Perhaps more importantly, the customer can also offload expensive database cycles and staging tables.

Since we do this for a living and see results like this from DMExpress all the time, it is easy to lose sight of the impressive results consistently coming from POCs. However, what we believe makes these results even more impactful is that they were achieved without consuming the entire box.

Stay tuned for more results in the days, weeks and months ahead. We’d also love to hear from anyone interested in learning more or who has seen similar results with their tools (please feel free to post a comment). We are also willing to take on those interested in challenging us to a benchmark!

{ 0 comments }

It’s been a while since Part 2 of our “Data Protection Survey Series.”  I’ve been very busy preparing for Syncsort’s sales kickoff coming up in a couple of weeks and also doing some early prep for NetApp Insight in Macau, China in February (hope to see lots of NetApp partners there!).  Kickoff should be a great affair with lots of interesting guest speakers, and I’ll be sure to provide some reports from the event. Meanwhile, back to our survey!

Our final topic is recovery, and the short version of our survey results is that data recovery is truly at risk for many organizations. Systems are not being protected as they should be, and confidence levels are not high.

We started by asking what percent of servers were being backed up each night (broken out into physical and virtual). For physical servers, only 29 percent of respondents were backing up 100 percent of their servers. This means that 71 percent had some amount of exposure to unprotected data. 

On the virtual side, results were both better and worse.

A slightly higher percentage of users (31 percent) were backing up 100 percent of their virtual machines (VMs), but there were more users protecting less than 50 percent of their VMs. 

The first problem around recovery is that a lot of data (roughly 30 percent) is not even being backed up on any given night. However, the question specifically asked respondents to only consider their backup schedules. In other words, what percentage of your servers are you even trying to back up?  It didn’t take into account backup success rates, so that was our next question.  What percentage of your backups complete successfully each night?

Only 18 percent of users are seeing 100 percent nightly success rates. The bulk of respondents (57 percent) were getting what is typically considered a reasonable success rate of between 91 and 99 percent. However, a full 25 percent of respondents were at 90 percent or less success, adding a significant amount of data exposure to their organizations each night.

With all these issues around backup, we wanted to see how confident users were about data protection. The answer: not very.  We wanted to know how people would view a major disaster where an entire data center was lost, so we asked:  “In the event that you lost an entire data center, how confident are you that you could restore application services in a timely manner?”  Here are the responses.

Only 14 percent considered themselves “totally confident” with another 33 percent “very confident” (defined as: “I expect most recoveries will succeed but I am not convinced I can achieve 100 percent recovery of all systems”).

More than 50 percent of users had a significant degree of uncertainty.  In fact, the results are worse than shown here because 14 percent of total respondents said they didn’t have disaster recovery in place at all!  They responses were excluded from the chart. So, well over half of our survey participants are effectively risking their businesses in the event of a major disaster.

Our final question was around disaster recovery (DR) testing.  DR testing is usually a rather difficult affair, often involving long hours on weekends spent trying to bring up systems.  But testing is critical: it’s the only way to know you can actually restore your data  when you need to.

Again, we see a lot of potential exposure. A little over half of respondents test DR at least once a year. The rest range from less than once a year to never, or they have no DR to test. When we correlated “confidence” with “testing,” it was not a big surprise to find that among the group that were “totally confident” they could restore data, 60 percent of them said they test more than once a year and 29 percent tested once a year.  That’s a huge correlation of 89 percent of the “totally confident” users testing their DR once a year or more.

On the flip side, of those that were “reasonably confident” they could restore data, only 9 percent tested more than once a year and 32 percent tested once a year.   You couldn’t ask for a clearer indication that “testing equals confidence,” and that’s why one of the things I like to emphasize about NetApp Syncsort Integrated Backup is that it makes DR and testing your DR so easy.  It’s no wonder that more than 90 percent of NSB customers deploy at least two NetApp FAS units, one for local backup and recovery and the other for remote-site disaster recovery.

Data protection and recovery are serious concerns and can’t be taken lightly. I certainly don’t think that all the risk exposure our survey uncovered is because users are indifferent to the problem.  What they are is overwhelmed. Too much data plus disruptive new technologies like virtualization have made conventional backup models obsolete. 

As our survey showed, this has led to a mix of problems:  too many products being usedbackups taking too long, and recovery at risk.  Backup needs to be modernized and it can’t happen too soon!

{ 0 comments }

According to Albert Einstein, the definition of insanity is repeating the same actions and expecting a different result. While I won’t go quite as far as to call it insanity, it has always bothered me that people keep tuning ETL tools that can’t handle larger data volumes. Over Christmas I had an experience which helped me understand at least some of the logic behind it.

It was Christmas Day and I was staying at the home of my fiancé’s parents. I had taken an inflatable bed so that we could stay the night after indulging in way too much turkey. Having managed to shoehorn the bed into a room that was entirely too small for it, I settled down to sleep. Shortly thereafter at about 3 a.m., I woke to find that I was being swallowed by the mattress. It had developed a slow puncture. For those of you that haven’t experienced it, moving around on a deflating air mattress is not easy or fun!

Knowing that if I got up and off the mattress it was going to deposit my fiancé onto the floor, I had little choice but to inflate the mattress from where I was (waking up everyone else in the house in the process). From that point forward, I spent nearly every hour repeating the same process of inflating the mattress until it was time to get up for the day. Needless to say, I was grumpy and the rest of the house was irritable that entire day. There was also a large air mattress deposited directly into the rubbish bin!

This whole situation got me thinking. Even though I knew it wouldn’t help for more than an hour, why did I continue to inflate the mattress throughout the night?

For starters, I didn’t think that I had any other options (although the 4 hours I spent sleeping on the sofa the next day while Boxing Day chaos continued around me proved that wrong). I also thought (at least for the first inflation at 3 a.m.) that inflating the mattress would permanently solve the problem. It was after the second time (okay, probably the third) that I got wise.

Bringing my crazy story back to ETL, the vast amount of people out there “tuning” ETL tools are likely working on this same logic. The first time they do it, there is probably an assumption made about only needing to do it once. The second time, they maybe think that they just didn’t get it quite right last time and this time will work exactly right. The third time, the harsh reality of their situation starts slowly seeping in as they realise they could be doing this for the rest of eternity and not get the result they are seeking.

However, here is the thing. Ultimately, I knew I only had to keep inflating the bed that one night. The next day the leaky air mattress would be in the bin and I’d be at home sleeping in my own bed. People who “tune” ETL tools don’t have that luxury. They know data volumes are increasing (between 10% and 500% a year depending on which customer I talk to) and fundamentally their ETL tools aren’t going to help. Sure, they can try and buy more hardware (a bigger air mattress), but that’s just a temporary (and very expensive) measure because that leak is definitely going to reappear.

In fact, given all the discussion about Hadoop and Big Data, I am now picturing an elephant standing on a deflating mattress! For those of you that made it to this point in my post, thank you for sticking with me. Now it is your turn. I’d love to hear about your thoughts and experiences tuning ETL tools to handle larger data volumes. Comments are welcome!

{ 0 comments }