Data infrastructure optimization, availability & security software
Data integration & quality software
The Next Wave of technology & innovation

Expert Interview (Part 1): Dr. Sourav Dey on Applying Machine Learning to the Real World

At the recent DataWorks Summit in San Jose, Paige Roberts, Senior Product Marketing Manager at Syncsort, had a moment to speak with Dr. Sourav Dey, Managing Director at Manifold. In the first of this three part interview Roberts spoke to Dr. Dey about his presentation which focused on applying machine learning to real world requirements. Dr. Dey gives two examples of matching business needs to what the available data can predict.

Roberts: So, let’s get started! Can you introduce yourself and where you work, and what you do?

Dey: My name is Sourav Dey. I’m a Managing Director at Manifold. We’re an AI engineering services firm that focuses on accelerating AI projects for high growth and Fortune 500 companies. I did my Ph. D. in Computer Science at MIT, and I’ve been doing algorithms engineering for the last 10 years, both real-time algorithms on embedded processors and big data algorithms in the cloud. Manifold does both, but most of our work has been on the latter. We build custom AI solutions for our clients.

You were a data scientist before data scientists were cool!

Yes, yes, before it was the thing. (both laughing) But I like the term. Yeah, I’m a data scientist. Though that seems to be losing vogue, and “machine learning engineer” is the new hotness. I’m fine with that, too.

You’re still doing the same thing, whatever folks want to call it this week. So, I know you just did your talk. You gave some interesting case studies, some great machine learning implementation examples. Would you like to talk about a couple of the examples from that?

Yeah. I gave two examples in my talk.  One was about work we did for one of the leading baby registries in the US. We helped them with go from unstructured desires to a clear engineering spec that could be built.

You condensed business need down to something you could build?

Exactly. I think that one of the key takeaways from the talk was you have to learn about the business needs by getting in the customers shoes as well as understand their data. It’s the marriage of the two where you can come up with the spec that an engineer can build to. Doing that is much of the challenge. For example, by doing a deep dive into their business we learned that this baby registry company wanted to make decisions faster.  Because of the nature of their business, it would take nine months for any marketing or product experiment to have the final measured output that they could then make a decision on.  That is far too long in the age of the Internet.

So we make them a set of predictive models that would predict what is likely going to happen nine months later, after a day, after 2 days, after 7 days, after 30 days, etc. We’ve deployed these machine learning models to production, and now they’re able to make decisions much more quickly, because that model is very accurate. They are able to make decisions on marketing campaigns and product changes much more rapidly by doing AB tests and looking at the model output.  Before, they were using heuristics to make their decisions.  Now they can make more data-driven decisions, more rapidly.

Read our white paper: Why Data Quality is Essential to Machine Learning Success

Okay. So capturing the business need was a big piece of that. No matter how good your machine learning is, or your AI predictions are, at the end of the day, if they don’t meet the business need, and if the business need isn’t big enough to give a good ROI, it’s not worth bothering with the project.


So that was the first big point I got from your presentation. The second one was matching the business need up with the data. It might be extremely important and very valuable to the business to be able to make a prediction, but if they don’t have the data to support that prediction, you’ve got a problem. Can you talk about that one a little?

Yeah, the baby registry example was a good, positive outcome. The second example was a much more challenging data problem. Our client was an oil and gas company that wanted to make their maintenance operations more efficient. The dream was, “If I can predict when these machines are going to fail, I can turn unplanned maintenance into planned maintenance”. They had two major data sets. One was a sensor data set, one-minute tick samples of many different sensors coming off of the machines into their cloud. The other major data set was human entered maintenance logs in their workflow software. It asked documented what parts their replaced, how long they worked, and had a lot of freeform notes.

Early on, during our data audit phase, we found that a lot of that human-generated data was very untrustworthy. It just wasn’t captured in a way that we could get good value out of it. There were five to seven different types of failures that they were particularly interested in. It turns out these failures were not labelled well in the maintenance logs because the root causes was not documented. Also, the way it was captured changed over the five years of history that they had. There was no good way to label the historical failures, and see that a specific thing failed at this specific time because of this reason.

They didn’t record the root cause. They just recorded that they replaced a specific part at a specific time.

Correct, and you could replace those parts for various reasons. An expert, maybe, could go back and figure out what happened there, but retroactively labelling the data would be costly and slow.

We ended up having them, going forward, capture the root cause analysis, improving the data with clean labels going forward. But it wasn’t in the historical data they has already captured. That’s why, we were unable to deliver the dream as they originally envisioned it.

But, what we were able to do is predict a different class of failures using purely the sensor data. That data is much more trustworthy without the data lineage issues of the maintenance logs. We ended up focusing on major faults where the machine went off line for over two hours. We were able to create a successful predictive model using the historical sensor data to predict these faults. This was useful to their maintenance operations, but at the same time, many faults that are not as interesting to the business are caught with this definition of failure. It’s always a trade-off. This was the best we could do with the data that we had.

Come back for part two where Roberts and Dey discuss augmented intelligence, AI as Triage, and the importance of model explainability.

Check out our white paper on Why Data Quality Is Essential for AI and Machine Learning Success.

Related Posts