It was only a matter of minutes after the discovery of the missing Malaysian Airlines flight MH370 that questions began to be asked about the data.  How could a Boeing 777 jet airliner disappear without a digital trace?  What about the data the airplane collected?  What about the tracking data?  The transponders?  How about the satellite “pings”?  What can  the cell phones’ GPS’s tell us?

It was assumed that the quickest path to answering questions about the mysterious flight disappearance was to decipher the big data digital stream that the plane must have created.

Read my complete blog on the The Wharton IGEL Blog.

Follow Gary Survis on Google+


Dr. Vincent Granville is Co-Founder of Data Science Central and author of the upcoming text, Developing Analytical Talent: Becoming a Data Scientist.

Q. On the Analytic Bridge and Data Science sites, there is comparatively sparse reference to ETL. Does this reflect slow migration to platforms that support big data analytics when compared to greenfield big data initiatives?

I think ETL is not considered as analytics by traditional analytics practitioners (statisticians). It is regarded as data plumbing and software engineering/architecture, or computer science at best. However, it is a critical piece of big data with its own analytics challenges (yield optimization, server up-time optimization, selection of metrics, load balance etc.) At Data Science Central, we are moving into getting more involved with ETL, especially with NoSQL environments (Hadoop-style frameworks). Also, I am currently working on a paper about new, massive data compression algorithms for rows/columns datasets, as well as hierarchical summary databases.

Q. Some observers believe there is a heightened potential for misleading or erroneous inference-making with Big Data. Do you agree?

Yes, when you are looking for millions of patterns in millions of data buckets, you are bound to find spurious correlations. Indeed what you think are strong signals (using traditional pattern recognition algorithms) is likely to be just plain noise. Still big data contains more information than smaller data. You need to use the right methodology to identify the real signals / insights. The issue is discussed in my article, “The Curse of Big Data.”

Q. Where do you think we are with Big Data along the Gartner hype cycle? (Feel free to amend their construct as you see fit.)

I think it depend on the user. For the 1% consuming 99% of all data, we are close to the “plateau of productivity”. For the 99% consuming 1% of all data, we are barely beyond “trough of disillusionment”.

Q. What standards – formal or de facto – are you watching most closely?

I am looking at what universities and organizations offer in terms of training and certifications. A number of organizations, including ourselves, offer non-standard training and certification: online, on-demand, low-cost, non-academic, practical. It will be interesting to see how this training ecosystem evolves.

Q. Which computing disciplines which existed 5-10 years ago do you feel will be most able in the transition to today’s data science frameworks?

I don’t think a particular discipline is better positioned. First, boundaries are fuzzy – machine learning is statistics (clustering), data mining and AI are machine learning, data engineering is computer science. Then, data science is a hybrid, cross-discipline science. It was created precisely as a reaction against silos – computer science, business analytics, statistics, operations research and so on.

Q. In the interview on your web site with Sir Cox, he says, “My intrinsic feeling is that more fundamental progress is more likely to be made by very focused, relatively small scale, intensive investigations than collecting millions of bits of information on millions of people, for example. It’s possible to collect such large data now, but it depends on the quality, which may be very high or not, and if it is not, what do you do about it?” What is Cox saying to communities like the American Society for Quality and others who would like to positively influence Big Data quality?

I just reposted the interview, not sure who interviewed David Cox in the original article. I have two reactions to David’s comment. First, many organizations faced with big data do a poor job at extracting insights with good ROI. Maybe they are not analytically mature enough to deal with big data. In this sense, David is right. But I believe the future is in developing methodology for very big data. It does not need to be expensive, see my article “Big Data is Cheap and Easy.” Big Data is very different from traditional data. Mathematicians found that some number theory conjectures, true for billions of numbers, did not work for all numbers once they got enough theoretical knowledge or computing power to do really big tests; likewise I expect data practitioners to make similar discoveries about big data.

Q. Are what Cox calls “small, carefully planned investigations” orthogonal to self-service tools like Tableau or QlikView, which can operate in a hypothesis vacuum?

I’m sure Cox is talking about experimental design, a branch of statistics. I believe model-free, data-driven, assumption-free testing has a lot of potential. Much of the criticism about traditional statistics is about testing one hypothesis against another one, while in fact you’d like to find out, out of 20 hypotheses, which ones seem to be provide the most reliable causal explanations.

Q. What changes, if any, would you like to see in the data science curriculum to improve data visualization?

Data videos as described in my most recent book (Developing Analytical Talent: Becoming a Data Scientist, Wiley April 2014) and the ability to put more dimensions in what is, at the end of the day, a two-dimensional chart. You can do this by playing with a full spectrum of symbols, dot sizes, colors and palettes. But it comes with a price: training decision executives on how to quickly interpret more sophisticated graphs. In any case, simplicity is paramount in any visualization, or your end users will eventually misinterpret or stop looking at your charts.

Q. Those in the non-academic, media side of the data business extol the virtues of narratives and storytelling. They’re not always clear, though, about what data to use and when. How should they proceed in curating data-driven, data-annotated stories?

I like good infographics. Cartoons might also have a potential to quickly convey an important concept or finding, though as of today, I know no one communicating insights via cartoons.

Q. The forums you have fostered have been around awhile now. How has the population in the forums changed over time?

The population has become more purely data-oriented over the last two years, with a shift from pure analytics to computer science, data engineering and architecture. In some sense, it is now more mainstream, even attracting IT people.

Q. What trends are you watching most keenly?

I am looking at the types of technologies that are getting more popular. While everybody was and is still talking about NoSQL, people realize that Map-Reduce can’t efficiently solve everything, that there are solutions such as graph databases (technically, it is also NoSQL) to address spatial problems (weather predictions) for instance. I also see a number of statisticians or business analysts who either want to be more involved in data science, or are aggressively protecting their turf against any kind of change – positive or negative. Some practitioners are opposed to a hybrid cross-discipline role such as data scientist, and claim that well defined roles (I call it silos) are better. I believe that start-ups are more open to data science and automation to compete against bigger companies – for instance to create a crowdsourcing application that predicts the price for any standard medical procedure in any hospital in the world, while big healthcare in US has been unable to provide any kind of pricing to patients asking about the cost of a procedure.

Q. Do you have a few favorite software tools that you feel are indispensable?

I have been using open source for a long time, including many UNIX tools and then Perl for many text-processing intensive applications (prototyping and development). Python (check out the Pandas libraries for data analysis) and R (which I like for visualizations) are now more popular, and better integrate in a team environment. I have used Syncsort to perform large sorting on credit card transactions, when working at Visa. Putty (Telnet), FileZilla (FTP), Snip (for screenshots), web crawlers (I do it in Perl), the Cygnus environment (a kind of UNIX for Windows), Excel, and knowledge of SQL, HTML, XML, JavaScript, a programming language such as C, UNIX, and R, are indispensable.

Mark Underwood writes about knowledge engineering, Big Data security and privacy @knowlengr LinkedIn.


Syncsort VideosIf you’ve ever done some serious shopping for your next high-tech gadget, chances are you’ve seen one of those “un-boxing” or “haul” videos on YouTube. I have to confess, at first I thought they were kind of silly and self-indulgent until I found they had a practical purpose. Recently, I was on the fence about an item I wanted to buy. Out-of-the-blue, I remembered the videos on YouTube and sure enough there was a review on the item I was thinking of buying.  Actually, there were several reviews.  I was pleasantly surprised that the video was thorough and practical, as if it were tailored to my interests.  Afterwards, I couldn’t wait to get to the store and buy that item. I can admit I was wrong and now I am inspired to un-box Syncsort DMX-h. From a practical standpoint people want to see what they are getting and the things they can do with it. You’ve heard all the chatter about Syncsort, its extremely fast, yet easy-to-use Hadoop ETL engine. The good news is now you can have a better look at some of the most common use cases you can deploy in Hadoop, with DMX-h. So, let me introduce to you our new collection of videos. Each video targets a very specific use case, short and sweet! In this, our initial release, you will find the following videos:

If a picture is worth a thousand words, then video is definitely priceless. Especially, when it comes to making your life a little easier. Sure Hadoop is a great platform to process large amounts of data and of course writing MapReduce code might be fun to some, but why spend time re-inventing the wheel – sorting, aggregating, joining data – when you could spend your time doing more – predictive analytics, complex algorithms and so on – with Hadoop? So, hopefully, you will find the selection of tutorials brief and that they demonstrate practical business uses. Who knows? You may want to show off to your friends and upload your own Syncsort “haul”.


The-BI-Architecture-Syncsort_2All companies are facing huge issues from the explosion of Big Data that is breaking traditional architectures. Yet while this disruption presents great opportunity – to get new insights that are transformative to the business the existing architecture and way of doing things has to change – more of the same has already proven to not be the right answer.

To learn more, read my guest Blog on the Cloudera Blog site.

Find out more about the Cloudera and Syncsort partnership.

Follow Steven Totman on Google+