The only constant in Big Data is change, or is it?
Attending Strata + Hadoop World 2015 in New York City is always an amazing experience because there is so much you can do and learn. Sessions, tutorials, networking and vendor visits are very eye-opening every time you attend one of these shows. To see how fast the big data ecosystem changes and evolves in such a short amount of time is incredible, but can sometimes be very overwhelming. In some ways, Strata + Hadoop World is a microcosm of the Hadoop ecosystem. It’s exciting. It’s promising. But once you try to digest everything thoroughly, it’s exhausting. This time out, I took a step back and evaluated how I could best utilize my limited time for such an extensive event. I decided that since the technology evolves so quickly, I wouldn’t focus on the technology, I wanted to see how people actually used the technology.
Strata + Hadoop World is where cutting-edge science and new business fundamentals intersect—and merge.
With this in mind, I attempted to reach out to as many attendees, speakers and vendors to learn what goes on in their big data ecosystem. What I found was kind of shocking to me, to be honest. For all the promise of “just load all your data into HDFS and your business will instantly gain insight and make more money hand-over-fist”, no one I could find actually did this. Instead, the process was virtually the same as it was before Hadoop was even born. The only thing that changed was the underlying technology to be able to accommodate the growing volume, variety and velocity of the data as well as the growing demand from the business.
Back in the day, you had maybe 3 to 5 data sources that you wanted to pull into your data warehouse, so that you could pull reports from that data. Having that few of data sources made things like data wrangling, quality, lineage, and analysis very trivial to do, even by hand (a very primitive form of technology). Maybe you’d convert the data from the write-optimized format of your transactional systems to a read-optimized format like a star or snowflake schema making the reporting much easier and faster. You could even do this on your data warehouse in staging tables. You could then build OLAP (Online Analytical Processing) cubes for easier slicing and dicing of data for analysis and insights.
Demands (real-time, self-service, etc.) and data growth (more sources, larger volumes) are increasing exponentially while technology is changing just as fast to keep pace. New tools are emerging seemingly daily to handle a specific function of the data pipeline, at scale. These tools are fantastic at their focus, but their messaging tends to over-simplify the necessity of the rest of the data-pipeline processes (e.g. Analytics tools promising you no longer need ETL). In my conversations with attendees, these promises fall short, leading to confusion, frustration, and feeling like they don’t know where to start.
My advice for them would be simply: Don’t change your process, change your technology.
When you have a technology-first approach, it’s like having a hammer… everything starts to look like a nail. Find the nail first. Take, for instance, batch SQL queries on your data warehouse, or mainframe. Studies show that a majority of data warehouses are performance or capacity constrained. Being able to replace the technology for these workloads from them would be a perfect nail. Having a familiar, well-defined use case like this also allows you to focus on the new technology, rather than the intricacies defining a brand new use case.
With tools like Syncsort DMX-h, you can make the transition to Hadoop even easier. If your SQL ELT jobs were written a long time ago, by someone no longer at the company, or just really long and messy, SILQ can help visualize and offload these to Hadoop with the click of a button. At Strata, I noticed every single session with “Spark” in the title was packed to the brim. This was because many attendees were moving workloads within Hadoop from MapReduce to Spark. Two years ago, I remember this same phenomenon happening with MapReduce V1 to MapReduce V2. DMX-h was able to help its customers make the transition seamlessly with Intelligent Execution. It doesn’t matter where you end up deploying the job (on-site, cloud, Hadoop), the design of the job is the same. This helps developers keep pace with the quickly changing big data technologies without having to learn an entirely new stack every 12-18 months.
To keep pace with the growing demands of the business for real-time data, Syncsort DMX-h provides support for Apache Kafka. This allows developers to blend both real-time and batch data (Lambda Architecture) with the same graphical UI.
In fact, Wikibon just published a research report – “Simplifying and Future-Proofing Hadoop”, that addresses this dilemma and how you can get started (and stay ahead) with tools, like Syncsort DMX-h, that help to hide much of the complexity.