At the recent Cloudera Sessions event in Munich, Germany, Paige Roberts of Syncsort had a chat with Katharine Jarmul, founder of KJamistan data science consultancy, and author of Data Wrangling with Python from O’Reilly. She had just given an excellent presentation on the implications of GDPR for the European data science community.
In the first part of the interview, we talked about the importance of being able to explain your machine learning models – not just to comply with regulations like GDPR, but also to make the models more useful.
In this part, Katharine Jarmul will go beyond the basic requirements of GDPR again, to discuss some of the important ethical drivers behind studying the data fed to machine learning models. A biased data set can make a huge impact in a world increasingly driven by machine learning.
In part 3, we’ll talk about being a woman in a highly technical field, the challenges of creating an inclusive company culture, and how bias doesn’t only exist in machine learning data sets.
In the final installment, we’ll discuss some of the work Ms. Jarmul is doing in the areas of anonymization so that data can be repurposed without violating privacy, and creating artificial data sets that have the kind of random noise that makes real data sets challenging.
Paige Roberts: Okay, another interesting thing you were talking about in your presentation was the ethics involved in this area. If you’ve got that black box, you don’t know where your data came from, or maybe you didn’t really study it enough. You didn’t sit down and, how did you put it? Introspect. You didn’t really think about where that data came from, and how it can affect people’s lives.
Katharine Jarmul: Yeah, there’s been a lot of research coming out about this. Particularly when we have a sampling problem. For example, let’s say we have a bunch of customers, and only 5% are aliens. I will use that term, just as if they were Martians. We have these aliens that are using our product, and because we have the sampling bias, any statistical measurement we take of this 5% is not really going to make any sense, right? So, we need to recognize that our algorithm is probably not going to treat these folks fairly. Let’s think about how to combat that problem. There’s a lot of great mathematical ways to do so. There’s also ways that you can decide to choose a different sampling error, or choose to treat groups separately in your classification. There are a lot of ways to fight this, but first you must recognize that it’s a problem.
If you don’t recognize that there MIGHT be a problem, you don’t even look for it, so you never realize it’s there.
Exactly, and I think that’s key to a lot of these things that are coming out that are really embarrassing for some companies. It’s not that they’re bad companies, and it’s not that they’re using horrible algorithms, it’s that some of these things if you don’t think about all the potential ramifications in all the potential groups, it’s really easy to get to a point where you must say, “Oops, we didn’t know that,” and to have a big public apology you must issue.
Something like, I have made decisions on who gets a loan, or who gets a scholarship, or who gets promoted, or who gets hired, and it’s all based on biased data. I didn’t stop and think, “Oh, my dataset might be biased.” And now, my machine learning algorithm is propagating it. There were a lot of talks about that at Strata. Hillary Mason did a good one on that.
Oh excellent. Her work at Fast Forward Labs on interpretability is some of the best in terms of pushing the limit for how we apply interpretability, and therefore this accountability that comes with that, to “black box” models.
Because if you don’t know how your model works, you can’t tell when it’s biased.
Exactly. And, if you spend absolutely ALL of your time on “How can I get the precision higher?”, “How can I get the recall higher?” and you spend none of your time on, “Oh wait, what might happen if I give the model this data?” that the model might have not seen before, that is data from perhaps a person with a different color of skin, or a person with a different income level, or whatever it is- “How might the model react?” If you’re not thinking about those things at all, then they’ll really sneak up on you [laughs].
Be sure to check out part 3 of this interview, where we’ll discuss the challenges of women in tech, and the biases that exist, not just in our data sets, but also in our culture and our own minds.
If you want to learn more about GDPR compliance and how Syncsort can help, be sure to view the webcast recording of Michael Urbonas, Syncsort’s Director of Data Quality Product Marketing, on Data Quality-Driven GDPR: Compliance with Confidence.
Katharine Jarmul on If Ethics is Not None
Katharine Jarmul on PyData Amsterdam Keynote on Ethical Machine Learning
Keith Kohl, Syncsort’s VP of Product Management on Data Quality and GDPR Are Top of Mind at Collibra Data Citizens ’17