Hadoop Co-Creator on the Future of Big Data
When one of Hadoop’s original developers tells other developers to pay attention to Google’s prognostications, it is advice IT executives should ignore at their peril.
“Google is living a few years in the future and is sending the rest of us messages,” Doug Cutting told the O’Reilly Strata Conference in London this November.
For example: Google Spanner, a distributed database technology which has attracted attention for its use of a concept Google calls TrueTime. TrueTime allows Google data centers across the world to remain in sync with each other, while avoiding excessive latencies.
Cutting’s Hadoop, the open source distributed database project he co-founded, owes a debt to Google’s earlier MapReduce concept. Synchronization is an important aspect in maintaining Hadoop responsiveness and scalability across widely distributed databases. Further study of Spanner’s implementation will likely be reflected in future releases of Hadoop — or entirely new offerings inspired by Spanner.
Watch Cutting’s presentation below, or scroll down for more predictions.
Doug Cutting on The Future of Data
Cutting does not anticipate the imminent death of relational database systems, or the mature ecosystems around them. But he does envision that some aspects of large enterprise information requirements will demand “Google-like” flexibility or agility. Enterprises will need to determine how to proceed simultaneously with OLTP systems while embracing Big Data.
Weaving YARN into Distributed Data Systems
Extending Hadoop to handle new and different types of processing loads is the focus of an emerging set of support tools. Applications such as machine learning or real time event processing such as for smart grid sensor networks will place new demands on Hadoop clusters. Processing will need to be staged, queued and scheduled.
New support tools like Apache YARN, HortonWorks commercial YARN and Syncort’s Ironcluster for Amazon EMR and Hadoop ETL represent an emerging ecosystem for Hadoop that anticipate real world enterprise requirements.
Graph databases such as Apache Hama and Faunus leverage the Hadoop Distributed File System (HDFS) but may prove useful for different sets of applications. Graph databases may provide useful for enterprises engaged in large scale research, engineering and genomics.
Hadoop-Sweetened Business Intelligence Suites
Business intelligence suites such as Tableau, QlikView, SAP Hana and Business Objects can access to Hadoop stores through a variety of methods, including those provided by Syncsort or Cloudera. This relatively recent Hadoop flavoring allows architects to concoct heterogeneous recipes that consist of online transaction processing (OTLP) and HDFS.
Back to School for Updated Lessons
Distributed processing has been heavily studied in academia since the 1970s. Faster pipes, Big Data and mega-core computing clusters can be seen as long-anticipated evolutionary changes. Sorting, compression, bit maps, synchronization techniques and other approaches continue to be essential software building blocks. Seminal work such as C.A.R. Hoare’s Communicating Sequential Processes continue to influence distributed systems designs.
Managers will want to take stock of the current skills of staff engineers. Updated education will likely need to encompass more than Hadoop fundamentals.
The Other Oracle
Cutting was not suggesting that Larry Ellison’s company can be ignored as irrelevant — only that Google is emerging as the other oracle in Silicon Valley, and we would do well to listen to the company’s predictions.
Mark Underwood writes about knowledge engineering and Big Data.