Has your data science team taken a dip in a data lake lately?
Unlocking knowledge from raw data is a recognized goal for most organizations, but multiple paths can lead to actionable insight. It has been clear for well more than a decade that insight can be gained through a traditional data warehouse using business intelligence tools and dashboards. In other words, Big Data may dominate the news, but insights are still to be found through disciplined use of mature BI tools.
“Data lakes are not appropriate for all corporate storage,” says Datamation’s Christine Taylor.
Then what’s the buzz about data lakes?
Dip Into Data Lakes
Important distinctions can be made between a data warehouse and a data lake.
The data in a data warehouse is structured, sometimes with extensive metadata or pre-aggregated data designed to facilitate reporting of key performance indicators. A data warehouse can be organized to support predetermined requirements, such as budgeting or compliance. While a data warehouse does not typically reach petabyte levels, data warehouses can be quite large. Because warehouse data must be structured, specialists refer to deposits to a data warehouse as “schema on write,” since the data model is known in advance.
A data lake, on the other hand, may combine both structured and unstructured data. As KD Nuggets suggests, in a data lake “the data structure and requirements are not defined until the data is needed.” As a result, specialists refer to deposits to a data lake as “schema on read.” Uses for data lakes are still being discovered, but it is fair to say that the data lake is early in the adoption cycle, with most success stories coming from data scientists rather than the broader community of data warehouse users. As Tamara Dull at SAS explained in the KD Nuggets report, “A data lake is not a data warehouse. They are both optimized for different purposes, and the goal is to use each one for what they were designed to do.”
“What data lakes do well – very well – is to improve business analytics and BI by integrating results from multiple knowledge sources,” Christine Taylor adds.
Filling Lakes on z/OS
Not surprisingly, IBM sees z/OS as a platform perfectly designed to swell with data lakes for a multitude of emerging enterprise needs. Most recently, IBM has advocated real time analytics in z/OS from “multiple sets of data,” especially using a version of Apache Spark specifically designed for z/OS. The IBM offering includes the Spark cure plus Spark SQL, Spark Streaming, the Spark Machine Learning Library and Graphx. According to IBM, there exists a family of use cases for which z/OS IMS, VSAM, DB2 or SMF data can be accessed through Spark SQL.
Splunk is often mentioned as a useful tool for creating and exploiting data lakes. In a Datamation survey, Splunk has the ability to extract useful interpretations from sources as diverse as machine logs, social media data, sensor streams, application transactions and website data.
Machine Learning in Splunk
Splunk is increasingly seen as one of the go-to tools for data collected from the Internet of Things (IoT). IoT data can be readily collected through an array of z/OS and mainframe hardware features without resorting to offsite cloud latency and cost. Tony Cosentino at Ventana Research described Splunk’s current sweet spot as “. . . dealing directly with distributed, time-series data and processes on a large scale.”
One of the emerging benefits for a Splunk data lake is the ability to use machine learning. Splunk has released its Machine Learning Toolkit application (visit Github for details). Splunk’s IT Service Intelligence (ITSI) product was constructed using the Machine Learning Toolkit, and the approach can be generalized to a wide variety of other pattern recognition tasks.
Suggested uses for Splunk Machine Learning include predicting customer churn, detecting insider threat, managing capacity and recognizing maintenance issues.
Recently Medical Mutual of Ohio announced that it was using Splunk Enterprise to assist in fulfilling computer security objectives. Splunk enables Medical Mutual to fuse data from multiple data sources collected through Syncsort Ironstream and help identify potential breaches or other security risks.
Watch the video Enterprise Security with Ironstream + Splunk ES for more information on how organizations can now see, analyze and correlate all their critical distributed and mainframe-based security data, including SMF records.