Expert Interview with Christian Prokopp from Semantikoz.com
Your blog on Semantikoz is titled “Big Data Science and Cloud Computing.” “Data Science” is not original with you, but where do you get the “science” in the expression? The methods of experimental science, for example, do not tend to enter the conversation.
I started and named the blog at the time of my PhD and later was hired as a data scientist but practically worked as a big data architect. The blog reflects the change I underwent from science to industry. Many businesses ask data science questions but in fact have no big data architecture to answer these questions. So businesses hire PhDs able to use machine learning, R, Python/Java/Scala, and maybe Hadoop and hope they provide novel insight and products, which dazzle consumers and despair the competition.
Often, the reality is that there is a lot of work that has to happen before science becomes relevant. Initially a big data strategy needs to be defined and an architecture developed, deployed and maintained to support it. Applying science on top of subsequent data sets is only the last step in this workflow and importantly improvements early on in the data acquisition, cleaning and processing usually have a higher impact than focusing on complex algorithms at the end.
Why did you mention Altiscale in this context?
Altiscale provides what they call Hadoop as a service, which I would group into big data as a service within which others like Amazon Web Services or Qubole would fall. Many businesses want to do big data but lack the in-house experience and the scale to justify investments. It makes sense for them to employ services to explore the topic, shift risks from capital expenditures to operational one, and potentially bridge some of the architecture gap with third-party help. I find this space intriguing and am interested in its long-term evolution. Particularly the competition between bare metal performance solutions versus ones employing exchangeable, utility-like cloud computing resources is fascinating.
Around the time that this question is being composed, Google announced that Google Apps Unlimited would include unlimited storage. What implications do you think AWS, Google and Azure initiatives to press toward higher storage limits will have on business intelligence? Which of the big data V’s will be most impacted?
All three. Immediately, volume of the data stored will increase, but it also impacts the variety of data that is cost-effective to store and the velocity, since throughput has to keep up. The Lambda Architecture, for example, proposes to store (near) raw data and process it parallel in offline and online fashion. Inexpensive storage means that you can store the data in a most unrefined state for longer periods and add more data sources, but you will also need to scale your processing capabilities. Business intelligence, therefore, should benefit by including more data, more granularity, longer time windows, and hopefully this leads to better insight and predictions and subsequent improved decisions.
What technologies might down the road share the stage with map reduce and Hadoop?
There are a few important developments like Spark, Storm and YARN that will drive change in the next years. Spark and Storm challenge the offline-processing paradigm that has dominated Hadoop. Hadoop will continue to play a significant role thanks to YARN. It is an independent cluster resource management layer freeing Hadoop from MapReduce’s limitations. It effectively makes Hadoop the backbone distributed storage and compute platform enabling organizations to store and process petabytes of data inexpensively. Together with projects like Slider, which supports long-running applications in YARN, Hadoop becomes a distributed multipurpose compute platform.
Your blog post “Big Data Transforms Online Education” presents a mostly rosy view of the role of Big Data in education. To old timers, this is reminiscent of the enthusiasm first shown for computer assisted instruction in the 60’s. Yet there is a backlash underway in the U.S. against concerted data-based initiatives like Common Core and the nonprofit InBloom. How do you interpret these trends in light of the meritocracy envisioned in your post?
Technology advances and we should manage it, i.e. we have to have a discussion and then consensus on what we want to achieve, what is reasonable, what the rights of the individuals are. This is very similar to any social issues from health care to privacy laws. I think we are at the beginning of this discussion and technological evolution. We can envision personalized learning and aids to teachers that will require a change of how we teach and measure success in learning. The massive open online course work for higher education will lead the way since it has fewer barriers, e.g. the market may accept novel degree like certificates before regulations catch up. The experience from this development will transform early education too. It will take a long time though since traditions, emotions, and powerful stakeholders are involved.
At the end we have to strike a balance between the individual’s right for self-fulfillment, e.g. having a chance to compete in something (s)he enjoys but may not be predestined for, and the optimal outcome. We already do this today, i.e. if you are terrible at your job you will get fired even if it is your dream. On the positive side, new learning approaches may help you become better at something you love but were not predestined to do, which may help you get and keep your dream job.
You remarked on a dramatic performance difference between a 1,636-node Hadoop cluster and a Mac Mini. What role do you see for graph databases in a big data future?
Graph problems hide in many big data challenges like recommendations, for example. However, the real world and its data is a messy. Distinct, reliable relationships are rarely part of your original data and rather an outcome of refinement and transformation often employing machine learning to extract these relationships. In fact, many organizations try to extend their small core set of “clean” data with untidy, unstructured third-party data to gain more insight as an essential part of their big data strategy. The default architecture emerging around this is the data lake with Hadoop at its center and various inputs and outputs. Graph databases can be very useful in this architecture already to store highly refined data. In the long term, their broader success will depend on how their technological evolution can address seamless, unlimited linear horizontal scalability to compete with NoSQL stores that offer this without the graph specific benefits, i.e., the graph reasoning is pushed to the application or processing layer.
You wrote about the Lambda Architecture. How would current thinking about real-time big data differ if that was the dominant model?
Real-time big data processing would be commonplace and not something many still consider difficult. No one thinks of MapReduce as a challenge anymore, and neither should be real time. The Lambda Architecture is becoming popular since the natural evolution of many big data projects is integrating volume, velocity and variety of data.
Is training for new software developers adequately preparing them with models that support the Internet of Things?
The Internet of Things requires a broad set of skills from building software for embedded devices to distributed systems. A lot of the existing training and models are relevant though a holistic approach may require some rebalancing of traditional software engineering courses.
What standards work are you watching most closely, if any?
I don’t follow particular standards. I am more interested in patterns and technologies emerging around Big Data. The ideas around the data lake and data OS as well as patterns like the Lambda Architecture or Polyglot Persistence are very exciting.