Diving in the Big Data Lake? How to Keep Your Head Above the Waterline
I’m excited to share an announcement today about a partnership between Syncsort and Waterline Data Science. The partnership brings about a whole new way of pulling value from your data.
Imagine this scenario: your organization has chosen to offload some of the customer and product data from the EDW onto your Hadoop cluster; another group has been loading your website logs and twitter streams into Hadoop. How do you bring this data together to leverage for analytics?
Waterline can not only add schema and structure to that data, but may find correlations between it and your customers and products. This discovery represents new data inventory along with structure. Syncsort can then be used to join those data sets in ways that were never anticipated, and used to aggregate output in analytics-friendly formats like Tableau.
Syncsort has been helping many enterprise data architects move and process data to reduce load in their traditional Enterprise Data Warehouse (EDW) and mainframe environments. Syncsort DMX, as an ETL tool, allows fast and efficient data transfer, as well as enrichment and data cleansing on the way to Hadoop. Syncsort DMX-h on Hadoop then allows complex processing flows to be created very simply via a GUI. The processing does not require Java code generation or the use of inefficient SQL-to-MapReduce engines.
So much data is coming into Hadoop in the form of unstructured data that will never be housed in the traditional EDW. Waterline Data Inventory gives you the chance to learn about the data that you don’t know well. It can find relationships between the unexplored social media data and the well-understood enterprise data that originates in the EDW. The so-called “data lakes” that are being created have a wild mix of known data from the EDW and the mainframe, and unknown data from logs and websites.
We think that this kind of automated discovery, assisted by domain expertise in the enterprise, will lead to a new realm of value in the use of distributed processing frameworks like Hadoop.