Practitioners familiar with Syncsort products have a unique perspective on the health of data warehouses, and we’d like to get to that. But first, what is your role at Syncsort and where else have you worked?
I’ve been with the firm for around 2.5 years. I’ve been working variously as a solutions engineer, presales, and or sales engineering architect. Recently I moved into this new marketing role. For 15 years before joining Syncsort, I worked in data warehouse and database architecture, – the data warehouse side of business intelligence.
How did you hook up with Syncsort?
I had moved to a firm that had an extreme need for ETL as ELT was not scaling well for them. Theirs was a general business intelligence app. The company gave Syncsort a Proof of Concept assignment and they crushed it. So I was a Syncsort customer first.
How do you feel about the Big Data meme? In some communities of practice, it is regarded as hype, or just warmed over distributed computing, or a synonym for massively parallel computing.
It’s important to tease out two separate concepts around Big Data.
1. The business challenges. Here the paradigm hasn’t shifted too much. This is mainly about shifts in the makeup of the BI stack.
2. What is changing in the paradigm is the technology associated with growing data volume. Previous methods for accessing storage, e.g., were insufficient for the scaling issues. Hadoop does not solve business challenges all on its own. Great, you have an exabyte of data in HDFS, but what should you do with it? How do you extract value to solve specific business problems?
Has the concept of business intelligence been muddied by the Big Data meme?
For some people, HDFS is the data warehouse; though I tend to think of it as complementary. No one tool can solve the world’s business intelligence challenges. When speaking of BI as a suite, or a department, you’re solving a ton of technical challenges. Hadoop solves a subset of those technical challenges. But “big box” systems like Teradata can also solve some of the same challenges. OK, tirade alert: I am fully of the notion that data in and of itself is worthless. You can’t make decisions based on data alone. So organizations need to gather data elements and make meaning from the data. Information is next, and decisions can stem from that. Information is more consumable — e.g., for data scientist or reports. But this is still a limitation for decision-makers. An important goal is to turn data into metrics vs. just information. I’m thinking here of today’s executive who is primed to look at a metric and use it to support decision-making. This is where people want to go – toward decision support — whether people think of the solution space this way or not.
So where does Hadoop / HDFS fit into the warehouse?
What Hadoop is bringing to the world is a different way of storing, processing and distributing growing data sizes. This raw data layer (some call it the data “lake”) to support turning data into information is a starting point. The skills of data scientists are evolving alongside the technology, and this is facilitating the creation of more or better metrics. Prior to Hadoop, the way to deal with large volume was to discard it. Hadoop allows for refining the accuracy of metrics or find new metrics based on “hidden” data.
Where does this leave the traditional BI tools suppliers, like Microstrategy, Cognos, QlikView, Tableau, etc.? Are they out of the picture unless they can “speak” Hadoop?
These guys “work up the chain” from the raw data layer, or “up the stack” if you like. They have several ways to accomplish this: through Hive or flat files, or using the Hadoop cluster as ETL and then offloading that final dataset into the data warehouse. That last piece is going to be familiar to most current enterprises; that access layer is still king. I see a lot of organizations who try to save money by creating an access layer within Hadoop. Yes, I agree that newer people to the game may not understand the value of that layer; they may be trying to shoehorn that layer ‘just because they can” rather than addressing what is needed to support key underlying business decision-making. I suspect that the search for Hadoop developers is driving some of this.
Some data scientists that we have interviewed for Syncsort have worried that Big Data will expose underlying issues with data quality that were easier to ignore before Volume, Velocity and Variety scaled up. Do you agree? Is Veracity an opportunity area for Syncsort?
Absolutely. The traditional data warehouse has always concerned itself with issues of data quality and data cleansing. Will DQ drive business toward Syncsort? I would like to think so. Having good ETL processes – maintainable and well-curated – is definitely a key part of ensuring data quality for a warehouse, and Syncsort plays nicely there.
It’s a fairly short conceptual leap, but a very long commercial leap to move from ETL script to metadata management. How do Syncsort products interoperate with, and market to enterprise users of metadata management (MDM) tools? Has Syncsort DMX-h changed this picture at this stage of the game?
We interoperate with existing data warehouse MDM’s. Speaking as an implementer or architect though, I rarely see MDM being fully used. For agile companies who try to get stuff done, it’s difficult to keep a typical MDM solution alive. So we typically advocate doing the DMX metadata management within the tool. In other words, handle the metadata lineage and associated configuration management within DMX itself. DMX is not a replacement for an MDM, but it satisfies most of the needs – especially in rapidly deployed Big Data repositories where DMX-h tends to find a home.
Whither SQL in the Big Data World? Stay tuned for Part 2 of our interview with Mark Muncy.