Expert Interview (Part 4): Katharine Jarmul on Anonymization and Introducing Randomness to Test Data Sets
At the recent Cloudera Sessions event in Munich, Germany, Paige Roberts, Syncsort’s Big Data Product Marketing Manager, had a chat with Katharine Jarmul, founder of KJamistan data science consultancy, and author of Data Wrangling with Python from O’Reilly. She had just given an excellent presentation on the implications of GDPR for the European data science community. For this final installment, we’ll discuss some of the work Katharine is doing in the areas of anonymization so that data can be repurposed without violating privacy, and creating artificial data sets that have the kind of random noise that makes real data sets so problematic.
In the first part of the interview, we talked about the importance of being able to explain your machine learning models – not just to comply with regulations like GDPR, but also to make the models more useful.
In part 2, Katharine Jarmul went beyond the basic requirements of GDPR again, to discuss some of the important ethical drivers behind studying the data fed to machine learning models. Biased data sets can make a huge impact in a world increasingly driven by machine learning.
In part 3, we talked about being a woman in a highly technical field, the challenges of creating an inclusive company culture, and how bias doesn’t only exist in machine learning data sets.
Roberts: We’ve hit on several subjects here. What else are you working on?
Katharine Jarmul: I’ve been doing a lot more research on things like fuzzing data, test data, and how that relates to anonymization. I’ll be doing a series on that, but there are also some other cool libraries and things I can point to about that. As data scientists, we spend so much time cleaning our data, but how do we mess up our data? Not only to test our own workflow, and determine if it’s working properly, but also to perhaps do things like release it to third parties, or to other people, and have it be anonymized.
Yeah, there’s the example of the Netflix prize, the guy de-anonymized that data. And Netflix was like, “Oh, oops.”
That was supposed to be anonymous data. We thought it was anonymous data.
Yeah, I’m also on a big kick to find out how we can create synthetic data that really looks like our data that has…
You can test with it.
I worked doing healthcare data integration for a long time. We were doing EDI to COBOL which is a big jump in translation. All the pipelines we built were tested with fake data sets. I talked to the guys in charge of the team, told them that the minute we put real data through this system, it’s going to crash and burn. I don’t care how many of these EDI transactions you build with Marcus Welby, MD, and Barney and Betty Rubble, it’s not going to break the system like real data. Real data is always messier than we expect.
Yeah, and I think that if we find ways to be able to test with some of that noise, maybe we can even choose exactly what types of noise, or what types of randomness we want to pursue, then we can make sure that our validation is working properly. And if we don’t have validation, we should probably set that up [laughs].
We probably need some of that, yeah. Maybe. Just a thought. [laughing]
What could go wrong, right? [laughing]
Well, thank you for talking to me.
Thanks so much, Paige. That was fun.
Yeah. I always enjoy these interviews. I always learn something new.
More on Data Quality and GDPR
If you want to learn more about GDPR compliance and how Syncsort can help, be sure to watch our webcast: Data Quality-Driven GDPR: Compliance with Confidence.
On a related subject, be sure to read the post by Keith Kohl, Syncsort’s VP of Product Management: Data Quality and GDPR Are Top of Mind at Collibra Data Citizens ’17