Parallel ETL Tools Are Dead

August 29, 2012

They just don’t know it yet.

The critical flaw in parallel ETL tools is the fact that the data is almost never local to the processing nodes.  This means that every time a large job is run, the data has to first be read from the source, split N ways and then delivered to the individual nodes.  Worse, if the partition key of the source doesn’t match the partition key of the target, data has to be constantly exchanged among the nodes.  In essence, parallel ETL treats the network as if it were a physical I/O subsystem.  The network, which is always the slowest part of the process, becomes the weakest link in the performance chain. 

The result is that the CPUs and memory on the local nodes are rarely fully utilized.  Basically you have a system that under-utilizes the local hardware and over-utilizes the network.  It is not surprising, then, that an efficient SMP ETL tool often outperforms bigger, more expensive parallel ETL tools.  But, given the investment companies had made in these tools, it‘s difficult to justify a rip-and-replace strategy.

With the arrival of Hadoop, all of this has changed.  Hadoop provides low cost storage as well as the potential for scalable ETL via the Map/Reduce paradigm.  For the first time in a parallel environment, Hadoop guarantees that the data will be local to the nodes, a huge performance advantage.  Now, ETL designers can take advantage of the scale of Hadoop without having to pay the penalty for excess network traffic.  Because of this, as Hadoop matures, it could become the ETL platform of choice for large organizations.

Remember those ETL tools that were designed to be efficient in an SMP environment? They’re back! Now that the data is local, the ability of the tool to fully utilize hardware resources becomes even more important.  A tight, efficient engine provides Hadoop with the ability to scale both horizontally and vertically – more work with less!

These simpler tools will also help with the adoption of Hadoop as an ETL platform. Currently, there is a huge disconnect between the ETL designer and the Java programmer.  Most ETL designers don’t know Java and most Java developers don’t know data structures, so even if the processing is efficient, the coding isn’t.  Organizations will need twice the people to solve half the problem.  However, as these SMP ETL tools fully integrate with Hadoop, the visual design paradigm will be inherited by Hadoop making development much simpler and more data driven.  This combination means that Hadoop will be providing the ideal combination of performance, scalability, and ease of use.  At that point, why would customers pay for a heavyweight, complex, parallel ETL tool.  I’m betting they won’t.

{ 0 comments }

Another August, another VMworld! As always, it’s a hectic, crazy time of non-stop activity and the San Francisco weather could not be more spectacular.

Once again this year, Syncsort is running the Race to Recovery in our booth (#501 if you’re at the show).  If you’re not familiar with it, in the Race to Recovery we invite two audience members up on stage to run a live VM recovery. No canned demos, no tricks:  the software is running live on ESX servers and NetApp storage.  So far this week, the fastest recovery time has been 1 minutes and six seconds!  That’s pretty good time to restore a VM. If you’d like to participate, stop by the booth to check out the next scheduled demonstration. The fastest recovery time each day wins an iPad and we’ve got two more days to go.

Aside from the Race to Recovery, we’ve been talking with lots of users and one theme has really come out this year: lots of backup products don’t work. I’m not going to name any names, but I’ve heard a number of stories this week about well known backup products that just aren’t doing what the vendors claimed they could do. Products that worked in the lab with 30 VMs but died in production with 300 VMs. Products that take forever to get a backup done and that take several days to complete a Bare Metal Restore process (that product was replaced with NetApp Syncsort Integrated Backup (NSB), and BMR restore is now just an hour or so).  A product that has been installed for six months and isn’t working yet!

Some of the users are still struggling along with their products, others were telling stories from the past before they switched to NSB. The most amazing single statistic was a customer that used to need eight hours to back up their Exchange server and is now backing it up in five minutes! That’s right, five minutes!  You can’t even quantify that as a backup “improvement.” It’s a backup revolution.

Across all these conversations what struck me was something that I don’t always appreciate about NSB, which is that it works. You get used to that and you forget that not everything else works as advertised. But NSB’s simplified architecture, exceptional reliability and deep integration with NetApp storage combine to provide a solution that does what you need it to do. Back up fast with little impact, restore in minutes, and integrate with replication and disaster recovery for a complete solution.

Drop by booth #501 to learn more.

{ 0 comments }

In case you haven’t heard, Peter Eicher, one of the smartest data protection experts Syncsort has ever seen, became a free agent this summer. More specifically, he became a blogging free agent.

With 15+ successful seasons in the computer software industry under his belt, Peter had suitors coming at him from nearly every direction. In fact, there might have been even greater anticipation for Peter’s decision on where to blog than a couple of summers ago when Lebron James held “The Decision” TV special to announce he’d play basketball in Miami.

Never one to draw unnecessary attention to himself, Peter simply issued a statement:

“I had a tough decision to make this summer. After thinking long and hard about it and consulting with my colleagues at Syncsort, I have decided to take my blogging talents to Computerworld and join their guest blog team.”

We’re excited about Peter’s new blogging venture. Have no fear, you can continue to read his industry observations here on the Syncsort blog as well. Peter has promised to not forget where his blogging career all began!

For now, check out Peter’s first official post for his Computerworld blog, Data Protection Insights, on confronting data loss events and how sometimes there are events that go beyond our more manageable and typical “catastrophes.”

For those of you at VMworld next week, don’t be shy about stopping by and meeting Peter at the Syncsort booth (#501).

{ 0 comments }

Walter Curti Joins Syncsort

August 23, 2012

I’m delighted to report that Walter Curti has joined Syncsort as Vice President of Data Protection Engineering.  Walter is a long-time player in the data protection space and a true visionary.

I first met Walter back at Cheyenne Software where I worked in the mid 90s.  He was in charge of the Windows division at Cheyenne where we worked on ground-breaking products like ARCserve data backup software and InocuLAN anti-virus software. Those were the early days of what we used to call client-server computing, and the innovations were coming fast and furious. Cheyenne was breaking new ground in protecting open files, in integrating backup and anti-virus with applications, and with unique ways to make tape backup faster (including “Tape RAID” – whatever happened to that?). This was back in the day when a backup that measured in Gigabytes was a lot! These innovations seem routine today, but back then integrating backup with a database was cutting edge.

Another event I remember very well was when Cheyenne developed the first Bare Metal Restore application for Windows NT. That project was driven by Walter and I recall him telling me how Microsoft said it was an impossible task. There was no way to restore a server without going through the full Windows NT install and then copying back the data. You couldn’t restore the previous system state. Well Walter didn’t let Microsoft put him off, and sure enough he figured out how to do it and Cheyenne was the first vendor with a BMR product for NT.  

This refusal to accept defeat is characteristic of Walter’s approach to technology and innovation. He is also a terrific team leader and motivator who knows how to listen to his people. You can’t say that of everyone in this business!

Walter and I took different roads after CA acquired Cheyenne and I didn’t work with him again until our paths cross briefly several years ago. I’m very excited to be working with him again now at Syncsort, and I’m confident that our users and partners will soon see the benefits his vision, enthusiasm and insight will bring to our data protection offerings.

{ 0 comments }