5 Big Trends in Machine Learning, Artificial Intelligence and Data Engineering from Strata NYC 2018 – Part 3
Here are the five big trends I noticed, and some of the things that Paco Nathan and some of the other folks at Strata had to say about them. In part 1, we covered trends 1 and 2 which focused on machine learning and the constant stream of data. The second part covered trends 3 and 4 and focused on artificial intelligence and Data Engineering.
5. Big Data Quality and Data Governance
Data is the new source code. You have to get the data right, both big and small.
No matter how awesome your hardware is, no matter how brilliant your software, to make machine learning work, you have to get the data right, not just the big data, all the data that drives a business.
MapR’s Ted Dunning, along with Ellen Friedman just came out with a new book, Ai and Analytics in Production. Ted Dunning referenced Gartner in his presentation, saying that only 15% of machine learning projects are in production.
I tweeted that comment, along with an image from his presentation:
In response, I got a reply from Johan Warlander, Data Engineer at Blocket in Sweden, who said, “I sometimes wonder how many of those ‘not in production’ big data projects happen in companies that don’t even have their ‘small data’ in order. That’s often where most of the low-hanging fruit is.”
“Before we even talk about introducing machine learning into your company, you need to get your ducks in a row as far as breaking down the data siloes, getting your workflows set up for cleaning your data, and establishing a culture that’s based around using data engineering and data science appropriately.” – Paco Nathan
To get the value out of machine learning projects, you have to get the data right first.
Data is the new source code. Nowadays, software isn’t where the greatest value comes from. For machine learning, good clean, labelled datasets are where the real value lies.
“A lot of companies that are the leaders in AI, they’ll share their code with you, but they won’t share their data.” – Paco Nathan
Paco mentioned that he’d seen talks from the folks at Figure 8 (@FigureEightInc, formerly Crowdflower), who specialize in human-in-the-loop machine learning. They get a fair amount of work charging millions of dollars just labelling datasets for things like sensors for deep learning in self-driving cars. And they have customers lined up, because it’s worth it. If those big car companies can’t get that data, they’re out of the self-driving car business.
Often people focus on the math, thinking you have to have the most sophisticated algorithm, to make the best model. But we’ve had the math for decades. The reason that AI is taking off now, rather than back in the 80’s or the 90’s, is that we couldn’t crunch the data back then. We didn’t have mountains of data, and if we had, we couldn’t afford to store it, or to process it until the modern big data technologies came along.
“We needed to have millions of cat pictures on the internet before we could make deep learning work.“ – Paco Nathan
Before we could create something that could identify a cat picture, we needed that data. That’s just the nature of the AI game.
Cloud and microservices and a lot of other aspects are changing the way we manipulate data, making it more attainable to more people, but at its heart it’s still the same old ETL that we always did on “small data.” Often, people now are doing ETL without even realizing it. They’re calling it data preparation or data munging or data engineering, and saying ETL is dead.
“ETL is no more dead than a caterpillar is dead when it becomes a butterfly. ETL has transformed.” – Gwen Shapiro
When we were talking about how important and fundamental data has become, Paco Nathan, smiled at me.
“Your company, Syncsort, is in a really good space because, you’ve got to get the data right. And it’s not just a one-off. You’ve got to keep getting the data right across your company now and forevermore.” – Paco Nathan
Data engineering is such an essential aspect of AI and ML that a lot of folks in the this space take it for granted, like you never think about the foundation of your house.
Maybe, it’s even becoming a little bit boring.
Make sure to download our white paper on Why Data Quality Is Essential for AI and Machine Learning Success.