Expert Interview (Part 3): Dr. Sourav Dey on Data Quality and Entity Resolution
At the recent DataWorks Summit in San Jose, Paige Roberts, Senior Product Marketing Manager at Syncsort, had a moment to speak with Dr. Sourav Dey, Managing Director at Manifold.
In the first part of our three-part interview Roberts spoke to Dey about his presentation which focused on applying machine learning and data science to real world problems. Dey gave two examples of matching business needs to what the available data could predict.
In part two, Dey discussed augmented intelligence, the power of machine learning and human experts working together to outperform either one alone.
In this final installment Roberts and Dey speak about the importance of data quality and entity resolution in machine learning applications.
Roberts: In your talk, you gave an example where you tried two different machine learning algorithms on a data set, and didn’t get good results either time. Rather than trying yet another, more complicated algorithm, you concluded that the data wasn’t of good quality to make that prediction. What quality aspects of the data affect your ability to use it for what you’re trying to accomplish?
Dey: That’s a deep question. There are a lot of things.
Let’s dive deeper then.
So, at the highest level, there’s the quantity of data. You can’t do very good machine learning with only a handful of examples. Ideally you need thousands of examples. Machine learning is not magic. It’s about finding patterns in historical data. The more data, the more patterns it can find.
People are sometimes disappointed by the fact that if we’re looking for something rare, they may not have very many examples of it. In those situations, machine learning often doesn’t work as well as desired. This is often the case when trying to predict failures. If you have good dependable equipment, failures are often very rare – occurring only in a small fraction of the examples.
There are techniques, like sample rebalancing that can address certain issues with rare events, but fundamentally more examples will lead to better performance of the ML algorithm.
What are other issues to be aware of?
Another aspect, of course, is the data labeled well? Tendu talked about this, too, in her talk on anti-money laundering. Lineage issues are a problem. Things like, oh, actually, the product was changed here, but I never noted it. That means that all of these features have changed. This comes up a lot, particularly with web and mobile-based products where the product is constantly changing. Often such changes mean that a model can’t be trained on data before the change because it is no longer a good proxy for the future. Labeling is one of the biggest issues. I gave you the example for the oil and gas where they thought they had good labeling, but they didn’t.
How about missing data?
Missing data is surprisingly not that big of an issue. In the oil and gas sensor data, it could drop off for a while because of poor internet connectivity. For small dropouts, we could interpolate using simple interpolation techniques. For larger dropouts we would just throw out the data. That’s much easier to deal with than labelling issues.
Can you talk a bit about entity resolution and joining data sources?
Yes, this is another problem we often face. The issue is about joining data sources, particularly with bigger clients. They’ll have three silos, seven silos, ten silos, sometimes in really big companies even have 50 or 100 silos of data, where they’ve never been joined, but they’re of the same user base.
The data are all about the same people.
Right, and even within a single data source, it needs to be de-duplicated. It’s the same records. I’ll give a concrete example. We worked with this company that is an expert search firm. Their business is to help companies to find specific people with certain skills, e.g. a semi-conductor expert that understands 10 nanometer micron technology. Given a request, they want to find a relevant expert as fast as possible.
Clean, thick data drives business value for them by giving their search a large surface area to hit against. They can then service more requests, faster. Their problem was that they had several different data silos and they never joined them. They only searched against one. They knew that they were missing out on a lot of potential matches and leaving money on the table. They hired Manifold to help them solve this problem.
How do we join these seven silos, and then figure out if the seven different versions of this person are actually the same person? Or two different people, or five different people.
This problem is called entity resolution. What’s interesting, is that you can use machine learning to do entity resolution. We’ve done it a couple of times now. There are some pretty interesting natural language processing techniques you can use, but all of them require a human in the loop to bootstrap the system. The human labels pairs, e.g. these records are the same, these records are not the same. These labels are fed back to the algorithm, and then it generates more examples. This general process is called active learning. It keeps feeding back the ones it’s not sure about to get labelled. With a few thousand labeled examples, it can start doing pretty well for both the de-duplication and the joining.
The compute becomes pretty challenging when you have large data sets. Tendu mentioned it in her talk on Anti-Money Laundering, you have to compare everything to everything, and do it with these fuzzy matching algorithms. That’s a challenge.
That’s a challenge, yeah. One of the tricks is to use a blocking algorithm which is crude classifier. Then, after the blocking, you have a much smaller set to do the machine learning base comparison on. That being said, even the blocking has to be run on N times M records where N and M are millions of records.
Where if you have seven silos and there’s a million records each and a hundred attributes per record, it’s a million times a million seven times …
It’s blows up quickly. That’s where you have to be smart about parallelizing and I think that’s where the Syncsort type of solution can be really powerful. It is an embarrassingly parallel problem. You just have to write the software appropriately so that can be done well.
Yeah, our Trillium data quality software is really good at parallel entity resolution at scale.
I like to work on clean data, and you guys are good at getting the data to the right state. That’s a very natural fit.
It is! You need clean data to work, and we make data clean. Well thank you for the interview, this has been fun!
Check out our white paper on Why Data Quality Is Essential for AI and Machine Learning Success.