At the recent Cloudera Sessions event in Munich, Germany, Paige Roberts of Syncsort had a chat with Katharine Jarmul, founder of KJamistan data science consultancy, and author of Data Wrangling with Python from O’Reilly. She had just given an excellent presentation on the implications of GDPR for the European data science community.
In this first part of a 4-part interview, Katharine Jarmul discusses the importance of being able to explain your machine learning models – not just to comply with regulations like GDPR, but also to make the models more useful.
Paige Roberts: Let’s get started by having you introduce yourself.
Katharine Jarmul: I’m Katharine Jarmul, and I’m a data scientist. I founded my own company called KJamistan. We are both an American company and a German company. I work with clients on solving data problems. If they need somebody who has a more small, agile approach, if they need a proof of concept, or particularly if they’re trying to figure out where to move forward with ethical, or interpretable, or testable models, those are the kinds of areas that I have started to specialize in, and that I care a lot about.
So: “Have smart data scientist brain. Will travel.”
Yeah, [laughing] to some degree, yeah.
I just saw your presentation, and a couple of things jumped out at me. One was that you did a really good job of explaining how GDPR affects the machine learning and AI community in Europe. Can you give a little bit of a summary, or a general idea of what you talked about?
GDPR changes a few of the ways that we have to inform users about automated processing of their data. And to do so, we want to take an inventory of how we’re doing that currently, and put some thought into it. For some people, this is already very well-documented. It’s put together well, things are tested and interpretable. They already have a lot of transparency. I know for other people, this is really scary.
It’s a black box.
Yeah. And they haven’t thought a lot about how to make it repeatable, how to make it accountable. This is something that I think we should all be thinking about anyway, regardless of GDPR. What GDPR gives us is a motivation to get started on that. I would say we should go beyond it to create something really accountable, and reliable. Then, we can explain it to everyone within our companies. Then everybody knows: How does the decision process work? What type of data are we using?
Everyone gets it.
It’s silly to me when I talk with people, and realize that there are fiefdoms of data, and fiefdoms of automated processing. Nobody knows how the other departments are using things. I think that that’s really dangerous in a lot of ways.
It’s bad when companies don’t even know how their own decisions work.
Yeah, exactly [laughs]. How am I supposed to explain to a customer as, let’s say a sales representative, or as somebody in the customer service team, if it’s not clear and easy to be explained, then how can I even sell the product, or how can I explain the problem?
How can you justify decisions? Well, I made this decision. Why? A machine learning algorithm that I don’t understand working on some data that I don’t know anything about told me it was a good decision. That is not a good justification.
Exactly. And I think that some of the folks that are moving us forward in terms of that are people who work in medicine and financial industries. They have legal ramifications and/or life or death ramifications of trusting something they can’t explain. I think that they’re really making some good inroads into, how do we have really accurate advanced models that we can still explain.
We must understand how a conclusion or prediction was reached. Don’t just give the answer and not explain how we got there.
Really great things are happening in interpretability research. Quite a lot of the research is asking for area experts to give the ground truth. So, they’re trying to say, “Okay, what things can we basically take away from the model, and say that this should always be this way.”
What do we already know is always, basically true?
Exactly. And if we have those things that we already know are ground truths, and we then build a model on top of those, that model is usually way more performant. So, we need to talk with the area experts.
It’s going to be a lot better if you take advantage of that expertise. It’s a theme of this industry. People and machine learning working together are always going to be better than the people or the machine by themselves.
Exactly. When you read the history of AI, this is the original things that all the folks who are working in Cybernetics and so forth really wanted to say. How can we use the machine to make better decisions, rather than just listen to whatever decision that a machine had [laughs].
Sure, let’s just let the machine order us around blindly, yeah. That’s a good idea.
[laughing] Exactly, I mean [chuckles] we have a brain for a reason.
Be sure to check out part 2 of this interview, where Katharine Jarmul will go beyond the basic requirements of GDPR, to discuss some of the ethical drivers behind studying the data fed to machine learning models.
If you want to learn more about GDPR compliance and how Syncsort can help, be sure to view the webcast recording of Michael Urbonas, Director of Data Quality Product Marketing on Data Quality-Driven GDPR: Compliance with Confidence.
Katharine Jarmul on Towards Interpretable Reliable Models
Keith Kohl, Syncsort’s VP of Product Management on Data Quality and GDPR Are Top of Mind at Collibra Data Citizens ’17