Data infrastructure optimization, availability & security software
Data integration & quality software
The Next Wave of technology & innovation

Expert Interview with Florian Douetteau

Florian Douetteau is the founder of Dataiku, a Paris-based data science firm.

Q. The offerings at Dataiku are grouped into Visual Data Enrichment, Guided Machine Learning, Shared Insights and Data-Driven Computation. Which of these generates the most interest from visitors? Which of these is used most by long term customers?

Our Data Science Studio provides an integrated experience, where you perform visualization, cleansing, enrichment, and predictive analytics on a dataset all at the same time. Today it requires this combination of different things to create unique applications out of data, and that’s this combination that our customers are looking for.

Q. The Dataiku Studio provides some ETL-like functionality to users. What other ETL needs do your customers have before they can leverage your resources?

Our customers are using all kinds of ETL such as Talend, Syncsort and Informatica. When we talk about data, it’s about moving and transforming data, one way or another. I think we will see lots of special purpose tools with ETL capacities in the future. Our focus is to provide efficient support for predictive oriented applications.

Q. The Dataiku site describes the Data Science Studio. It’s said to feature “a smart resource management system and built-in data quality checks.” Can you describe the data quality checks performed and how the need emerged as the Studio product evolved?

When running Big data workflows on the long term, you definitely need to check for evolving patterns, such as number of unique visitors, number of unique products, ratios, etc. When working with behavioral data, such as web site logs or mobile app data, unexpected events are to arise. This must be taken into account into workflow management. We specifically designed our studio such as business users can apply rules and get a view in every step of the computed data, to check for business ratios in a way that a technical user would do.

Q. The Dataiku site exhorts prospective customers to “Do Something Big.” To what extent is the “bigness” a function of Big Data versus the use of predictive analytics, machine learning and other analytical resources you provide?

Our customers’ ambition is usually not about size. It’s more about building something unique, some new applications that provide values. We believe that our customers are pioneers for a new wave to come. Lots of unchartered territories remain to explore where you apply machine learning to data on which it was never applied before. Lots of unique applications remain to be built; that’s what uniqueness is about.

Q. What challenges arise from having to support Pig, Hive, SQL and Python for scripting, and Hadoop, MongoDB, relational and cloud storage (S3, CS) options within the Data Science Studio?

We live in a very interesting technological universe where lots of tools and options for working out data are available for free, thanks to the contribution of the open source community. Today, the challenge is more about applying the right tool and the right location and then tackling the complexity of having multiple storage systems and languages. You would prefer using Pig for some data munging, Hive for computations, Python for advanced modeling, ElasticSearch for search and Hadoop for large scale storage. We see our Studio as a Swiss army knife where the user can quickly access all those tools and technologies.

Q. Where does R fit into your thinking?

We do actually support R in our latest version.

Q. Does Dataiku complement or compete with the visualization and self-service capabilities offered by BI tools such as Tableau, QlikView, Spotfire, Cognos, etc.?

We do complement existing BI tools by providing integrated Visualize/Enrichment approach. It typically helps business users build visualization on top of data that would not have been integrated by IT yet.

Q. Are what Sir Cox calls “small, carefully planned investigations” orthogonal to self-service BI tools like Tableau or QlikView, which can operate in a hypothesis vacuum?


As of today, you have new challenges dealing with big data, where you want to perform analytics that have the sophistication of small and carefully planned investigations but keep the open-mindedness of a self-service approach. To do that you need to make business and IT work together on the same tool and project.

Q. Where do you think we are with Big Data along the Gartner hype cycle? (Feel free to amend their construct as you see fit.)

I think that analyzing Big Data in term of hype cycle is a bit misleading, as it refers to technologies which are now very mature (such as open source technologies used by the online advertising industries, or large scale databases used by the banking industry), but used today in different contexts where the industry or uses cases are less mature (such as internet of things or automated customer relationships uses cases).

Q. Those in the non-academic, media side of the data business extol the virtues of narratives and storytelling. They’re not always clear, though, about what data to use and when. How should customers proceed in curating data-driven, data-annotated stories?

Storytelling apply to data to the extent that you actually need to provide justification, drive and arguments when you want to build something out of data. Today, if you want to provide a dashboard that helps people take action out of data, more than green lines and bars are expected. You need to tell why, how and where you got the data that you want show. There’s a need to empower business people with a pondered view of data and the process to get it, and not just hide that in the IT realms. You need a full data story, not just the back cover.

Q. What standards – formal or de facto – are you watching most closely?

It’s not specifically a standard, but MLBase/Spark provide new interesting grounds to build data projects.

Q. What trends are you watching most keenly?

The use of data in bio-related areas, such as drug discovery, personalized medicine or personalized crops. Today, we toy-play with data generated by computers. What will happen when we get to the real thing?

Q. Do you have a few favorite software tools that you feel are indispensable?

I think immediately on two tools: SublimeText and OmmWriter. They changed my experience in writing text with a computer.

Mark Underwood writes about knowledge engineering, Big Data security and privacy @knowlengr LinkedIn.

Related Posts