How Hadoop Fosters Starting Small with Big Data
Start small, scale up, rinse and repeat.
This was the advice given recently by MongoDB’s Matt Asay.
While the buzz about Big Data is resoundingly about its Bigness, sooner or later the hype morphs into a more recognizable hum. That hum is very likely the sound of specific projects launching with tools from the Hadoop ecosystem, but using modest data volume.
Often the projects being launched in this fashion are tackling the Variety dimension of Big Data – more than Volume or Velocity. The reasons? Many organizations have unstructured data that has languished or gone unused. Also, unstructured data, used in conjunction with what Asay calls “incremental advances to exiting use cases,” can have demonstrable benefits.
The more Hadoop is focused on smaller-scale deployments, the more it will ultimately get used for big deployments.
Yes, There is an Echo
Asay’s perspective as a proponent for MongoDB is an important one. When the message was offered by Bill Franks on a Harvard Business Review blog in 2012, he foreshadowed Asay’s remarks. He suggested three general steps that should be taken to roll Big Data into an organization:
- Start small. Define “a few simple analytics” that do not require vast amounts of data or processing. He gives the example of a retailer who identifies which products each customer viewed after browsing an online catalog.
- Embrace a one-off sample. Skip the travails of setting up enterprise processes to capture all of the data all of the time. Franks suggests grabbing just a single month’s data for a single division. Keep the prototype manageable.
- Turn the analytics team loose. They can filter what they find to be useful, and will learn a lot about the data that will be useful to everyone involved. Sticking with the online retailer example, Franks suggests an analytics team might create test and control groups to test offers and use that data as part of the prototype.
Prototyping teams will find that they can be much more productive than when Hadoop was first introduced. Tools such as Syncsort’s Hadoop sort add-in and DMX-h ETL edition make life both easier and faster for developers.
Start Small. Hadoop was named after a toy elephant owned by the co-inventor’s son.
More productivity enhancements are on the horizon. Apache Sqoop is a Top-Level Apache project that facilitates movement of bulk data between Hadoop and relational databases. For legacy applications, this could well be a deal-maker.
The Sqoop announcement in 2012 tentatively declared that Sqoop “is in the early stages of fulfilling requirements of data integration around Hadoop. Since then, Sqoop has continued to evolve, and it should be no surprise that Syncsort developers have built and submitted a new open source contribution for Sqoop.
Go for the Gold Quick Win
The discussion so far has said little about the infrastructure needed to support a Hadoop initiative. Should it be cloud or on-premises? Cloudera or Amazon? Should analytics be fully integrated, or is there an investment in analytical or BI tools that needs to be leveraged – e.g., SPSS, SAS, R, QlikView, Spotfire, WebFocus, Tableau, or a newcomer like Dataiku?
These deliberations are important, maybe even critical for some organizations, but they also serve to shine a spotlight on potential sidetracks and rabbit holes. Perhaps it’s best simply to repeat Bill Frank’s simple directive.
Focus on some quick wins that prove the data’s value.
A small Big Data prototype may not win gold, but it will likely bring success.
Another way of stating the start-small approach is in risk management terms. Scaling up is typically a lower risk phase. If that rationale fails, talk to Accounting or Project Management. They’ll convince the doubters.