NoSQL is the New Black: How it Works with Hadoop
Sometimes instructors try to explain a concept to students by saying what it is not. The approach doesn’t always work. Consider the teaching that the color black is the absence of light. Black can be seen, it can be painted, it can be touched – it even appears on a color chart along with other colors. Children perceive black as similar to other colors, so trying to understand black as “NoColor” is hard for them.
The New Black
So it is with NoSQL, which is often understood to mean just “No SQL,” or “anything that isn’t a relational database.” As with the color black in optics, there’s a bit more to it than that. Some argue that “NoSQL” should be interpreted as “Not Only” SQL, because implementations may in fact incorporate both SQL and SQL-less solutions.
The image that sometimes accompanies discussions of NoSQL looks like this:
NoSQL = Not only SQL: A discussion of how it works with Hadoop
And is more appropriate than this one:
NoSQL = A sometimes-used logo that is a misnomer
Carlo Strozzi, who originally coined the term NoSQL, wishes it had been called “NoREL” – for “not relational.”
NoSQL and Hadoop: Either/Or?
With that bit of pedagogy out of the way, would-be big data architects want to understand whether they must choose between NoSQL and Hadoop. Should the two be separate — allocated for non-overlapping tasks, as some have argued? Should they both be used in tandem, with NoSQL processing in front of an HDFS store?
NoSQL databases include a multitude of designs. Sometimes they are grouped into categories, such as: column, document, key-value, graph and multi-model types. As can be seen from the diversity of NoSQL products in these categories – ranging from graph databases like AllegroGraph, Neo4j and OWLIM to document stores like CouchDB, MongoDB and even JSON – to mention just a few, there are numerous NoSQL design alternatives.
Some solutions may call for both NoSQL and Hadoop. Dale Kim, Director of Industry Solutions at MapR, emphasizes the similarities first. Both leverage commodity hardware, emphasize forms of parallelism, enable horizontal scaling and are happy ingesting giant log files, documents or video. There may be differences too, he says, representing what these two big data approaches do best:
In a typical architecture, you have your NoSQL architecture for interactive data, and your Hadoop cluster for large-scale data processing and analytics. You might use NoSQL to manage user transactions data, sensor data, or customer profile data. You can then use Hadoop to analyze that data for outcomes like generating recommendations, performing predictive analytics, and detecting fraudulent activities.
That said, Kim believes his own firm’s MapR technologies reconcile the bifurcation of these two promising technologies.
Heroes: The Unsung and the Still Singing
Whether NoSQL continues to be the “unsung hero” that Doug Henschen identified two years ago (refer to his use cases from MetLife and Constant Contact) is still unclear. It seems likely that big data novices will investigate NoSQL and Hadoop both separately and in parallel. Whether they see Hadoop HDFS at the tail end of a processing pipeline may depend on whether they are color-blind to the unique capabilities of different NoSQL implementations.
Still very much in the mix are SQL databases, which are themselves scaling up to address big data needs, and SQL-like queries are being supported within, or on top of both NoSQL and Hadoop implementations.
DMX-h Plus NoSQL Equals ETL
Ever a player in this still-evolving space, Syncsort facilitates adoption of NoSQL big data processing through its high performance DMX-h solution.
It seems that a big data implementation can be a multicolored rainbow affair.