At this year’s Strata Data Conference in New York City, Syncsort’s Paige Roberts sat down with John Myers (@johnlmyers44) of Enterprise Management Associates to discuss what he sees in the evolving Big Data landscape. In this final blog in the three-part interview, we’ll discuss the 80/20 rule of data science which points out that most data scientists spend 80% of their time getting data ready for analysis, rather than doing what they do best.
In case you missed the earlier parts of our interview… In the first part of the discussion, Myers pointed out a shift away from technology and toward business value and some advantages of in-memory processing for machine learning. In part two, we talked about how to deal with cultural pushback against machine learning applications and how to get machines and people working together to take advantage of the strengths of each.
Roberts: One of the things with machine learning that you’ve heard a hundred times the 80/20 rule, where 80% of a data scientist’s job is data prep. Getting the data where they need it in the format they need it. That’s where we help. So, I’m going to ask a totally self-centered question.
According to the 80/20 rule of data science, 4 days of each business week is spent on gathering data, while only one is spent on running algorithmic models. But what if data scientists already had the data they needed?
So you have an idea of what we’re up to at Syncsort. What’s exciting to you? What do you think is cool in this area?
Well, you’re right. Most of what a scientist has to do is you get the right data together so they can apply to their model, or to manipulate the data that they have. Now, if I don’t know where it is, I must go traipsing around looking for it. Being able to discover, being able to have it at my fingertips, being able to move it around and things of that nature …
Find it, join it.
Exactly. Back to the concept of what do data scientist like to do, do they like to manipulate data? No.
They like to run models, and they like to compare them, and they like to do that.
Build algorithms and play with them.
Exactly! That’s the real value it provides. If we could have systems to take on some of that burden and say, “Hey, maybe we’ve got a dataset.” If you can discover what’s in the systems and then say, “Oh, now I need to bring it to someone.” And say, “That’s a great one…”
And push a couple of buttons and boom, it’s where you need it.
Exactly. Now, instead of like you said, 80%, that’s four days out of the week. Right?
That leaves me one day to manipulate.
To do what I actually like doing.
Right, and flip that over. I now have one day to pull data, and I have four days to play with it.
How good are our machine learning algorithms going to be now?
Right. If I have one day out of my week. I’m going to get one answer, per se. Right?
If I have four days, I may have four answers or I may have eight answers. And now I’m looking at the best, not just a. If we can flip that over, I don’t think we can ever get rid of it because…
You’ve got to have the data. You can’t do anything with data until you first have it. And have it in the form that you need it.
Exactly. I call it spindle, fold and mutilate, and it’s not necessarily that way. But to go through that process, it’s gonna take…
You must get the data. You have to push it together. You have to change its form. You have to take out the stuff you don’t want.
Exactly! And then when you have that, and if you can flip that over, go from 80-20 to 20-80, then you have more time in your day.
And all your smartest people aren’t spending all their time playing around getting data in the right form.
Right. And I’m sure they’re like me. The more time I spent with my fingers in the data, the more insights I find. If I’ve only got two hours out of eight, or I’ve only got one day out of four, I will find a limited number of insights. But the more time I have, …
The more you’ll find.
… the more insights I will pull together, the more things I’ll do. And I think that’s like our data scientists. They’re special, expensive people and we want to help them be the best possible people that they can be.
Well, being an ETL person rather than a data scientist, that’s where I live. But my job is essentially to create something for them that makes it easier. To make sure that when you go to do your machine learning algorithm, you’re using all the data because there isn’t some feed coming in from Kafka that you can’t get to.
Or there isn’t some data sitting on the mainframe that’s like, “Hey, look, I have 20 years of customer data sitting there, but I can’t get to it because it’s on the mainframe in some obscure format that nobody’s heard of in the last 20 years.”
Well, back to something that we talked about a little earlier before the start of the recording, Select star, that’s sometimes not the greatest answer.
I only want these four columns.
Right. But I don’t know which four columns until after I get all that data and look at it.
Exactly! With the data scientist who has access to like you said, ALL the data. They’re going to pick the right four to five columns. If they’ve only got four to five columns…
Then they’re just taking what they got. And making do with it.
Exactly! So, I think we’ve got some great opportunities. I think it’s continuing to grow, and I’m really looking forward to what we can learn at the show.
In our eBook, Mainframe Challenge: Unlocking the Value of Legacy Data, we review ways to help you tackle the obstacles of data integration to unlock the value of your mainframe data.