Data infrastructure optimization, availability & security software
Data integration & quality software
The Next Wave of technology & innovation

Expert Interview with Vincent Granville

Dr. Vincent Granville is Co-Founder of Data Science Central and author of the upcoming text, Developing Analytical Talent: Becoming a Data Scientist.

Q. On the Analytic Bridge and Data Science sites, there is comparatively sparse reference to ETL. Does this reflect slow migration to platforms that support big data analytics when compared to greenfield big data initiatives?

I think ETL is not considered as analytics by traditional analytics practitioners (statisticians). It is regarded as data plumbing and software engineering/architecture, or computer science at best. However, it is a critical piece of big data with its own analytics challenges (yield optimization, server up-time optimization, selection of metrics, load balance etc.) At Data Science Central, we are moving into getting more involved with ETL, especially with NoSQL environments (Hadoop-style frameworks). Also, I am currently working on a paper about new, massive data compression algorithms for rows/columns datasets, as well as hierarchical summary databases.

Q. Some observers believe there is a heightened potential for misleading or erroneous inference-making with Big Data. Do you agree?

Yes, when you are looking for millions of patterns in millions of data buckets, you are bound to find spurious correlations. Indeed what you think are strong signals (using traditional pattern recognition algorithms) is likely to be just plain noise. Still big data contains more information than smaller data. You need to use the right methodology to identify the real signals / insights. The issue is discussed in my article, “The Curse of Big Data.”

Q. Where do you think we are with Big Data along the Gartner hype cycle? (Feel free to amend their construct as you see fit.)

I think it depend on the user. For the 1% consuming 99% of all data, we are close to the “plateau of productivity”. For the 99% consuming 1% of all data, we are barely beyond “trough of disillusionment”.

Q. What standards – formal or de facto – are you watching most closely?

I am looking at what universities and organizations offer in terms of training and certifications. A number of organizations, including ourselves, offer non-standard training and certification: online, on-demand, low-cost, non-academic, practical. It will be interesting to see how this training ecosystem evolves.

Q. Which computing disciplines which existed 5-10 years ago do you feel will be most able in the transition to today’s data science frameworks?

I don’t think a particular discipline is better positioned. First, boundaries are fuzzy – machine learning is statistics (clustering), data mining and AI are machine learning, data engineering is computer science. Then, data science is a hybrid, cross-discipline science. It was created precisely as a reaction against silos – computer science, business analytics, statistics, operations research and so on.

Q. In the interview on your web site with Sir Cox, he says, “My intrinsic feeling is that more fundamental progress is more likely to be made by very focused, relatively small scale, intensive investigations than collecting millions of bits of information on millions of people, for example. It’s possible to collect such large data now, but it depends on the quality, which may be very high or not, and if it is not, what do you do about it?” What is Cox saying to communities like the American Society for Quality and others who would like to positively influence Big Data quality?

I just reposted the interview, not sure who interviewed David Cox in the original article. I have two reactions to David’s comment. First, many organizations faced with big data do a poor job at extracting insights with good ROI. Maybe they are not analytically mature enough to deal with big data. In this sense, David is right. But I believe the future is in developing methodology for very big data. It does not need to be expensive, see my article “Big Data is Cheap and Easy.” Big Data is very different from traditional data. Mathematicians found that some number theory conjectures, true for billions of numbers, did not work for all numbers once they got enough theoretical knowledge or computing power to do really big tests; likewise I expect data practitioners to make similar discoveries about big data.

Q. Are what Cox calls “small, carefully planned investigations” orthogonal to self-service tools like Tableau or QlikView, which can operate in a hypothesis vacuum?

I’m sure Cox is talking about experimental design, a branch of statistics. I believe model-free, data-driven, assumption-free testing has a lot of potential. Much of the criticism about traditional statistics is about testing one hypothesis against another one, while in fact you’d like to find out, out of 20 hypotheses, which ones seem to be provide the most reliable causal explanations.

Q. What changes, if any, would you like to see in the data science curriculum to improve data visualization?

Data videos as described in my most recent book (Developing Analytical Talent: Becoming a Data Scientist, Wiley April 2014) and the ability to put more dimensions in what is, at the end of the day, a two-dimensional chart. You can do this by playing with a full spectrum of symbols, dot sizes, colors and palettes. But it comes with a price: training decision executives on how to quickly interpret more sophisticated graphs. In any case, simplicity is paramount in any visualization, or your end users will eventually misinterpret or stop looking at your charts.

Q. Those in the non-academic, media side of the data business extol the virtues of narratives and storytelling. They’re not always clear, though, about what data to use and when. How should they proceed in curating data-driven, data-annotated stories?

I like good infographics. Cartoons might also have a potential to quickly convey an important concept or finding, though as of today, I know no one communicating insights via cartoons.

Q. The forums you have fostered have been around awhile now. How has the population in the forums changed over time?

The population has become more purely data-oriented over the last two years, with a shift from pure analytics to computer science, data engineering and architecture. In some sense, it is now more mainstream, even attracting IT people.

Q. What trends are you watching most keenly?

I am looking at the types of technologies that are getting more popular. While everybody was and is still talking about NoSQL, people realize that Map-Reduce can’t efficiently solve everything, that there are solutions such as graph databases (technically, it is also NoSQL) to address spatial problems (weather predictions) for instance. I also see a number of statisticians or business analysts who either want to be more involved in data science, or are aggressively protecting their turf against any kind of change – positive or negative. Some practitioners are opposed to a hybrid cross-discipline role such as data scientist, and claim that well defined roles (I call it silos) are better. I believe that start-ups are more open to data science and automation to compete against bigger companies – for instance to create a crowdsourcing application that predicts the price for any standard medical procedure in any hospital in the world, while big healthcare in US has been unable to provide any kind of pricing to patients asking about the cost of a procedure.

Q. Do you have a few favorite software tools that you feel are indispensable?

I have been using open source for a long time, including many UNIX tools and then Perl for many text-processing intensive applications (prototyping and development). Python (check out the Pandas libraries for data analysis) and R (which I like for visualizations) are now more popular, and better integrate in a team environment. I have used Syncsort to perform large sorting on credit card transactions, when working at Visa. Putty (Telnet), FileZilla (FTP), Snip (for screenshots), web crawlers (I do it in Perl), the Cygnus environment (a kind of UNIX for Windows), Excel, and knowledge of SQL, HTML, XML, JavaScript, a programming language such as C, UNIX, and R, are indispensable.

Mark Underwood writes about knowledge engineering, Big Data security and privacy @knowlengr LinkedIn.

Related Posts