Data infrastructure optimization software
Data integration and quality software
Data availability and security software
Cloud solutions

Interview with Syncsort’s Mark Muncy: Did Big Data Kill the SQL Star?

In Part 1 of our interview, Mark Muncy, Syncsort’s manager of Technical Product Marketing spoke of the raw data layer, or “lake.” About Hadoop, Muncy explained that “What Hadoop is bringing to the world is a different way of growing data sizes. This raw data layer (some call it the data ‘lake’) to support putting data into information is a starting point.”

In Part 2 of the interview, we discuss the future of SQL and relational data base management systems (RDBMS) in the Big Data universe.

What role might future Syncsort products play in the semantic web, and with RDF (Resource Description Framework) repositories? Perhaps with other graph oriented databases?

There are probably places where Syncsort products are being used to support apps like these. This is still a very specialized area, and most ETL tools will not shine in this space (with graph and spatial data) efficiently. It’s not a traditional ETL play. For now, while I think this is an interesting future direction, it remains a bit outside the current scope of where most of today’s enterprise warehouses operate.

ETL receives little attention in software engineering curricula. What consequences does this have?

One way to think of it is this: ETL is a niche that is bigger than most developers acknowledge. People write ETL code all the time but don’t call it “ETL.” We know that this happens with Hadoop implementations now.

While infrequently mentioned when speaking of Big Data, Oracle and Microsoft are very much involved in Big Data standards groups and have offerings of their own.

Where do you think the traditional RDBMS fits today?

The market is shifting and redefining roles. Most organizations will continue to maintain a RDBMS for many purposes. There’s the DBA skills explanation, but a bigger reason for this is that the RDBMS is where traditional transactional systems data is stored. Think of Oracle applications and SAP as the elephants in the transaction processing room – creating millions of transactions per minute, all over the world. People moving to NoSQL for transaction storage are trying to skip this and let the business layer unpack transaction entities from a lower layer.

Regarding the movement to NoSQL, I don’t have a strong opinion one way or another. But it is worth noting that, perhaps as a matter of personal preference I tend to gravitate to an RDBMS for history – even though NoSQL has scalability going for it. I am probably not alone in this. Meanwhile, a Big Data repository can sit comfortably in the back end.

So far it’s unclear, but over time perhaps the Oracles of the world – let’s include traditional implementations of SQL Server or DB2 in that list — may have difficulties keeping pace with all that the movement to NoSQL entails.

What does this mean for the future of Syncsort products with Oracle and others?

Regardless of how the NoSQL trend plays out, Syncsort sells successfully into all sorts of SQL shops. We have no reason to displace any currently operating RDBMS technology. Here is an example of that. When I used DMX in a previous job, we had to transform data quickly to get it into our visualization suite. There was no visualization problem. Instead, DMX primarily solved a scalability issue.

Most BI stacks build first on an RDBMS to a star schema and then point the visualization tool to the star schema. This is just the starting point, but they would use, to stay with an Oracle-centric scenario, a product like Oracle Data Integrator to get from Oracle to the warehouse, and then use PL/SQL to transform that data further on a separate server. This represents a rather typical conventional data pipeline into a data warehouse. This is where a lot of people are at in large enterprises. A pipeline like this represents a rudimentary way of performing data transformation on their data warehouse. Once the limit of what transformations can be done in the data warehouse is reached, due to larger volumes, scalability issues and simply, cost of staging data, people modify the pipeline to do the ETL upstream, before the Data Warehouse.

True, Hadoop can happily sit in a pipeline to offload through horizontal scaling. That said, Syncsort DMX shines in a SQL shop as a join-and-aggregate speed demon – designed to blow past other tools either inside, or outside Hadoop. In that prior engagement, it also solved other issues by avoiding additional development.

Cloudera’s CTO Amr Awadallah writes about Hadoop’s “Schema-on-Read” vs. the RDMS “Schema-on-Write” approach . Echoing the former defense secretary’s famous line, he writes that Schema-on-Write is good for “Known Unknowns,” whereas Hadoop’s Schema-on-Read excels with “Unknown Unknowns.” Do you agree?

It makes sense, but developers most fully understand the pros and cons. It comes down to when you want to pay the price. Yes, Schema-on-Read is more fluid, but you have to pay for the structuring sooner or later. Serialized data formats allow for greater volumes and velocity, but to do analysis (i.e., get information), to complete deserialization into an object so you can read it — this involves additional processing, and often additional cost.

One cool thing happening with Syncsort DMX is that we’re adding new capabilities along these lines: take a JSON message and unpack it into a flattened data set that is properly formatted for an ETL process. This creates a format that is faster overall than leaving it in its original unstructured form (e.g., NoSQL) and trying to deserialize it using other methods.

SQL is far from dead, then. What other standards (de facto or otherwise) are you watching?

HDFS, MapReduce. ANSI SQL – these standards influence products like Syncsort SILQ and other products we have on the horizon.

Hadoop itself isn’t a standards-rich environment. Yarn, Spark, Pig – it’s the Wild West out in HadoopLand. We’re not yet in a phase of widespread Hadoop adoption – and perhaps there should be no standard yet.

All Give and No Take? What’s corporate life like for a company inside the Apache – Hadoop – Big Data ecosystem?

Stay tuned for Part 3 of our interview with Mark Muncy.

1 comment

Leave a Comment

Related Posts