Data infrastructure optimization, availability & security software
Data integration & quality software
The Next Wave of technology & innovation

Interview with Mark Muncy of Syncsort: A Peek Inside the Apache Hadoop Big Data Ecosystem

In Part I of our interview, Mark Muncy spoke of the raw data layer, or “lake.” About Hadoop, Muncy explained that “What Hadoop is bringing to the world is a different way of growing data sizes. This raw data layer (some call it the data ‘lake’) to support putting data into information is a starting point.” In Part 2 of the interview , we discussed the future of SQL and relational data base systems in the Big Data universe.

We conclude coverage of the interview in Part 3 with a discussion of corporate life inside the Apache – Hadoop ecosystem.

I made a list of software firms whose business model relies on a central open source offering. Some of them include Red Hat, Cloudera, Hortonworks and more. Syncsort is quite different, with a long history of successful commercial products. Looking beyond the company name, how might this history set Syncsort apart? Will customers glimpse this distinction?

I think there’s an important partnership that exists between commercial vendors and the open source community. Open source pushes the envelope of what’s possible. You can see what’s happening within the Hadoop community for a great example of this. The enhancements made every day are absolutely astounding. However, there’s a certain amount of comfort you get from having enterprise-grade support from a company for the technology you are implementing. Cloudera, Hortonworks and MapR are great examples of this. They not only are strong committers to the entire Hadoop ecosystem, but they provide utilities, training and support that are top notch.  Syncsort has taken a page from their book. We’re among the top 10 committers to Apache Hadoop with contributions to MapReduce and most recently Sqoop, but we also provide an enterprise-grade software solution that fits seamlessly within the Hadoop ecosystem as well.

Since Syncsort started working with Hadoop, it has begun to operate in the fast-moving Apache ecosystem. Hadoop is the media shorthand for the ecosystem, but there’s a lot more to it. Syncsort both contributes to and sells into the ecosystem. How is that working out? Is it a successful two-way street?

It’s an amazing two way street. I have to say we have found excellent partners in the open source – and more specifically Apache Hadoop -community. I think this is especially true because they tend to take a consultative approach. They help prospects – and Syncsort at the same time – to identify Big Data challenges and the best solutions to solve them, thus identifying opportunities for DMX-h.  We work closely with the various distributions to see how we can best help address these challenges. Even at the engineer’s level, this kind of collaboration is very helpful. Here I’m thinking of the engineer-to-engineer discourse that plays out in the JIRAs Syncsort has contributed. This dialog helps to enhance Hadoop but also mold our product direction and enhance our product engineering.

Resulting in a sort of de facto collaboration around the open source suite that has Hadoop at the center?


Given this connection to the ecosystem, which some people believe is cloud-focused, does Syncsort have an easier time finding a home in cloud, on-premises, or hybrid installations?

As a company that’s been around for over 40 years, we’re accustomed to seeing the “Pay for Processing” scenario.  For the mainframe, we’ve developed technologies that reduced TCO for customers by increasing the performance and efficiency of sorting. In the cloud, a similar value proposition applies. If we can reduce the execution time and processing power required, a cloud customer’s metered instance cost is lowered. Within  the “as a service group” of technologies, there may be a somewhat less compelling story, because they tend to be messaging- and transaction-oriented. Where we fit most easily is large batch data volumes. This doesn’t mean we cannot integrate as part of the overall solution, especially downstream from these services.

One of the interviews conducted for the Syncsort blog was with an executive at RetailNext. That conversation included some forward-looking ideas about retail Big Data, especially combining real time with batch analytics. Do you have a take in this approach as a possible Big Data design pattern?

The scenario you describe happens when historical data is merged with some segmenting and A-B testing to study whether consumer behavior can be changed. The analytics aspect incorporates the real time analytics. In your example, real time shopper data is used for shopper-specific route optimization and batch analytics for decisions like aisle placement. These two processing patterns will not cannibalize each other; they will be used in tandem. Big batch analytics often times help shape and derive the metrics and trigger-points within the real-time data sets.

When looking at the big batch processing side of things, what’s traditionally been thought of as “staging areas” denote batch processing, a well understood design pattern. What people usually mean by staging is a staging area for in-database transformations; I tend to think of this model as a precursor to more sophisticated patterns.

Generally, we like to say that ELT is bad and ETL is good. But conventional ETL is traditionally slow in settings before Syncsort enters the picture. That’s because processing is batch, and those systems (think Oracle, SAP) are tuned for transactional performance. More critically, when you have costly DB resources (say, Teradata), then you are paying for ETL processing inside an expensive resource. With a tool like DMX in the picture, you can do ETL without weighing down the warehouse.

Will cloud indirectly bring more attention to ETL processes then?

We think so. Current and planned product offerings are configured to take advantage of how things are shaping up in the Big Data ecosystem.  There will definitely be a spotlight focused on inefficient processes that chew up processing time in expensive legacy-architectures. We’re here to help customers reduce the TCO of their cloud initiatives.

Thanks for your time, Mark. Maybe we can take a deeper dive into these issues in a future chat.

Looking forward to it.

Related Posts