Expert Interview (Part 2): Paco Nathan on the Current State of Agile and Deep Learning
At the recent Strata Data conference in NYC, Paige Roberts of Syncsort has a moment to sit and speak with Paco Nathan of Derwen, Inc. In part one of the interview, Roberts and Nathan discuss the origins, current state, and the future trends of artificial intelligence and neural networks.
In the second part, Roberts and Nathan go into the current state of Agile and deep learning.
Roberts: Changing the subject a little, one of the other things you talked about which kind of struck me pretty strong is basically the father the Agile says, don’t do Agile anymore. [Laughter]
Nathan: [Laughter] Right!
Roberts: Can you talk about that a little bit?
Nathan: Yeah, I was referencing a recent paper this year, actually just a few months ago, by Ron Jeffries who created Extreme Programming. Pair Programming came out of that. Scrum came out of that. A lot of the things we recognize as Agile came from that. He was one of the signatories of the Agile Manifesto 20 years ago. Recently he came out saying that the definition of Agile that he’s seen floating around in industry don’t have anything to do with the intention that they were trying to strike at. He wrote down, “20 years later, here’s my advice for what you really need to do with your team. Let’s get away from the names, and let’s just really focus on how to make teams better.”
Roberts: Wow. Okay. What’s the paper that he did?
Nathan: It’s called “Developers Should Abandon Agile.”
That’s pretty interesting. I think there are tons of software companies right now that for them, that’s the Bible. You have to do Agile to survive.
If you saw the talk by David Talby that was a really good one too. It was called, “Ways That Your Machine Learning Model Can Crash and What You Can Do About It.” He’s done a lot of work, especially in healthcare, with machine learning and he just had case study after case study of what goes wrong. The point there was, the real work is not developing the machine learning model. The real work is once you put it into production, what you have to do to make sure that it’s right, and that’s ongoing.
Yeah. That’s always true.
I heard David’s talk in London five minutes before my talk, and I made a slide to represent some of the things he talked about because it fit in with what I was saying. I showed it and then there were arguments out in the hallway afterwards, because the Agile people were like, “How dare you say that!” It’s really salient because if I’m developing a mobile web app, and I have a team that I’m engineering director of, I’m going to bring in my best architects and team leads early in the process. They’re going to go define the high level definitions and define the interfaces. As the project progresses more into fleshing out different parts of the API and getting into more maintenance mode, I don’t have to have my more senior people involved.
With machine learning, it is the exact opposite. If I’ve got a dataset, and I want to train a model, that’s a homework exercise for somebody who’s just beginning in data science. I can do that off the shelf. But once you get deployed and start seeing edge cases and the issues that have to do with ethics and security, that’s not a homework exercise. Unless you’re in context, and actually running in production, you’re not going to know in advance what those issues are.
Yeah, but a lot of the conversation now is about the fact that most of your datasets are in some way biased, and there’s a lot of ethics involved in launching a machine learning model. I just saw an article online where they’re making ethics in machine learning a first year course for people that they’re training for ML and AI (Carnegie Mellon, University of Edinburgh, Stanford). I guess it actually speaks a little bit to what you said about putting your experts at the end during production. To a certain extent, it seem to me like you also want to have the experts at the beginning, looking at the data before it even starts the process.
Definitely. Deloitte, McKinsey, Accenture, all of them, when we do executive briefings, they all want it set at the beginning. Before we even talk about introducing machine learning into your company, you need to get your ducks in a row as far as breaking down the data silos, and getting your workflow for cleaning your data in place, and a culture that’s based around using data engineering and data science appropriately. You need to do all of those things before you can even start on machine learning. There’s a lot of foundation that needs to be done correctly.
I said something about the high percentage of machine learning projects that never make it into production on Twitter, and got a response from John Warlander, a Data Engineer at Blocket in Sweden. He said, “I sometimes wonder how many of those ‘not in production’ big data projects happen in companies that don’t even have their ‘small data’ in order. That’s often where most of the low-hanging fruit is.” I’ll put that in my blog post about the Strata event themes and industry trends. We’re talking about a lot of those important themes, so I’ll probably put a lot of quotes from you in it.
David Talby had a great quote, “Really, if you want to talk about AI in a product, what you’re talking about is what you’re going to do once you’re deployed and the products being used by customers. How do you keep improving, because if you’re not doing that you’re not doing AI.”
Well if you’re not doing that, you’re certainly not having that feedback loop. You’ve lost that. When looking at the improvement in accuracy over random chance of any model, there’s always that curve that says this is more and more accurate and then it becomes less and less accurate over time if you don’t constantly retrain your models. One of the themes for Syncsort, as a data engineering kind of company, is making sure that the data that you’re feeding in there is itself constantly refreshed and improved. You said something in your talk that stuck with me. The value in ML and AI right now isn’t as much in iterating through models, or getting the best model, it’s feeding your models the best datasets.
I mean if you want a good data point on that, a lot of these companies, even ones who are leaders in AI, will share their code with you. They’re not going to share their data. That was kind of the punchline of the situation with Crowdflower or Figure Eight. Google bought into self-driving cars, and they realized they could replace a lot of one-off machine learning processes with deep learning, but to do that, they needed really good labelled datasets. Other manufacturers saw their success and wanted to do self-driving cars, too. They hired the talent and the first thing they find out is that if they want to do deep learning, they don’t have enough data, or enough good, labelled data. So, they go to Figure 8 and ask, Hey, can you label our datasets?
Lukas Biewald, the founder of Figure Eight, was talking in San Francisco a couple years ago, saying, “Yeah, for about $2 – 3 Million per sensor, we’d be happy to work with you on that.” And he had customers lined up, GM and all the others, because …
Because it’s worth it.
Yeah and if they don’t have it, they’re out of the self-driving car business. It may be a high price but it will likely include years of data.
People focus so much on the models. I have to have the most sophisticated algorithm, …
No. That’s not it.
The only reason that AI didn’t take off back in the 80’s or the 90’s when you and I were first studying it, was because we didn’t have enough data. We couldn’t crunch it. We couldn’t ingest that amount of data and do anything with it, affordably.
There needed to be millions of cat pictures on the internet before we could really do deep learning.
Before we could create something that could identify a cat picture. That’s just the nature of the game.
That was the paper that launched it all. And then the open source for using GPU’s to accelerate it.
That’s really taking off more now in spaces other than video games. Walking the Strata floor, there are a lot more vendors out there taking advantage of GPU’s.
There’s nothing really sacred about the architecture of a GPU with respect to machine learning. It just happens to be faster than a general-purpose CPU at doing linear algebra. But now we’re seeing more ASICs that can do more advanced linear algebra, at enough scale that you don’t have to go across the network. That’s the game. We’ll probably see a lot more custom hardware. Basically we’re in this weird sort of temporal chaos regime where hardware is moving faster than software and software is moving faster than process.
Hardware ALWAYS moves faster than software. Most software is just now finally, in the last few years, catching up to things like using vectors to take advantage of regular CPU chip cache.
And now we’re putting Tensorflow compositions in GPU’s.
Exactly. And we’re creating compute hardware that’s specific to task. Software always lags behind the hardware and then business processes have to develop after that.
Yeah, you have to log some time doing the job before you can really figure out the process. I think you’re company is in a really good space right now. You’ve gotta get the data right. And it’s not just a one-off. You’ve got to keep getting the data right across your company. Now, and forevermore.
Yeah, tracking and reproducing data changes in production is a big challenge for our customers. If you made 25 changes to the data to make it useful for model training, you then have to make those exact same 25 changes in production so that the model sees data in the format it’s expecting. I’m doing a series of short webinars on tackling the challenges of engineering production machine learning data pipelines, including one on tracking data lineage and reproducing data changes in production environments. So is there anything else going on at the moment that you’d like to let us know about?
I have a little company called Derwen.ai. If you check there, we’ve got a lot of articles. It’s my consulting firm and we do a lot of work with the conferences. We get to see a real bird’s eye view, and we hear from all kinds of people. We’re like Switzerland. We get to hear what a lot of people are working on, even if they’re not ready to go public with it. I hear the pain points people are dealing with, and help out the start-ups. It’s kind of like a distributed product management role.
Cool. All right, well, thanks for talking to me. I really enjoyed your presentation.
Thank you very kindly, so good to see you.