Episode 66: How to Think Like a Data Scientist (Even While AI Does All the Work)
Download MP3[00:00:00] Dr Genevieve Hayes: Hello, and welcome to Value Driven Data Science, the podcast that helps data scientists transform their technical expertise into tangible business value, career autonomy, and financial reward. I'm Dr. Genevieve Hayes, and today I'm joined by Dr. Brian Godsey. Brian is a data science lead at AI platform as a service company DataStax.
[00:00:25] He is also the author of Think Like a Data Scientist and holds a PhD in mathematical statistics and probability. In this episode, you'll discover how the data science process has evolved in recent years. And how to adapt your approach for maximum impact in the AI era. So get ready to boost your impact, earn what you're worth, and rewrite your career algorithm.
[00:00:52] Brian, welcome to the show.
[00:00:54] Dr Brian Godsey: Thank you for having me.
[00:00:55] Dr Genevieve Hayes: To begin with, I want to share something that resonated with me from your book, Think Like a Data Scientist. For listeners who aren't familiar with this book, Rather than focusing primarily on tools and algorithms, Think Like a Data Scientist specifically addresses the data science process itself.
[00:01:15] And in the preface of this book, by way of explaining this approach, You wrote, While surveying new data science literature, it became clear to me that most authors would rather explain how to use all the latest tools and technologies than discuss the nuanced problem solving nature of the data science process.
[00:01:37] Armed with several books and the latest knowledge of algorithms and data stores, Many aspiring data scientists were still asking the question, where do I start? Now, although your book was published back in 2017, I think this perfectly captures what I still see happening in the data science world today.
[00:01:59] Data science programs train data scientists to be experts at programming, statistics, machine learning, and the like. But That's like training builders to use hammers and saws, without showing them how to construct a house. The data science process, how to apply these tools to create actual business value, is often either completely overlooked or just glossed over with a five second mention of CRISP DM.
[00:02:31] With the data science profession continuing to evolve, Especially now with the rapid emergence of AI tools, your focus on process over techniques feels more relevant than ever. Could you briefly outline what you consider to be the core elements of the data science process that you described in your book?
[00:02:55] Dr Brian Godsey: I feel like data science. You know, the science is in the name. And so bringing some scientific process to the process of data science, seemed like something that was not being discussed as much as it could have been.
[00:03:06] And so things like asking the right questions, both at the beginning, middle and end and really paying attention to the facts and gathering evidence and, not proceeding on assumptions things like that. It's just a scientific process, but then you apply it to data.
[00:03:22] You apply it to software, use the tools from various software tools to accomplish the scientific process that you're trying. It's the scientific process as described in however many books.
[00:03:32] Brought into the data and software world and so just getting through from the beginning to the end with, you know, you're always looking at results, but you always have a starting point that is incomplete. And so you need questions. You need evidence. You need progress towards the goal. And so really, that's all of that encapsulated, hopefully with enough detail to actually put it into action in software.
[00:03:52] Dr Genevieve Hayes: it's sort of weird. Data science in its very name is a science. And yet if you compare it to how, the hard sciences are taught, I'm just thinking about, you know, I did High school chemistry and physics. Right from day one in those subjects, the whole scientific process is drummed into.
[00:04:11] It's weird that that isn't the case with data science. With data science, it's just taught as here's a bunch of techniques now go figure out what to do with them.
[00:04:23] Dr Brian Godsey: I agree. I think back when I started. With data science and working at some software companies and tech companies years ago, data science had, I don't know how many memes of the internet and various graphics. And they said, data science has three parts. And then one part was software.
[00:04:38] One part was statistics. The third part, depending on who you talk to, always swapped out. Some people said it was domain expertise. Some people said it was I don't know, machine learning and certain very specific technology parts that were not included in software or statistics or something like that.
[00:04:53] And so. We all know that software statistics are involved. The 3rd part was always in question. And so that's the part where either you have expertise, or you have a way to solve problems. And so, for me, I don't know if problem solving is that 3rd part, or it's just the overarching concept because really I always think of it as is a problem to solve that, you know, we have software and we have statistics and we have other tools at our disposal that we can then use to solve the problem.
[00:05:20] So, if the problem happens to be something that I can solve, great. If it happens to be something that a classical machine learning algorithm can solve, great. If it happens to be. Anything else? It's just finding out, what do you have? What do you need? And then, proceeding to, put the data into the model, finding the results, interpreting them, all of these things that really are analogous to the scientific process.
[00:05:41] It's just that we're doing them all on a computer now, instead of being in the lab or being in the wild collecting samples.
[00:05:47] Dr Genevieve Hayes: One thing I particularly like about the process that you outline in your book though, is that It feels a lot more grounded in the real world than some of the other data science processes that I've come across. So probably the most famous data science process is CRISP DM. I think everyone has the five minute go through of CRISP DM when they learn data science, and then everyone forgets about it afterwards. And I think that's because CRISP DM doesn't work in the real world. You've got the six stages, which I think are business understanding, data understanding I forget what it is. It's basically data exploration, data modeling, and then evaluation, and whatever. Conceptually that's fine, but firstly, they never tell you exactly what's involved in each of those steps, and Secondly, life never proceeds as neatly as that CRISP DM cycle implies that it should proceed.
[00:06:55] And one of the things I really like about your book is that you actually acknowledge that data science is uncertain and the data science process is uncertain. And so in your process, You have, many of those steps similar to CRISP DM, but you've also said, okay, , we understand what the problem is, but then we understand the fact that our knowledge of it is uncertain.
[00:07:22] And so we have to make a plan that allows for that uncertainty. And you've got, these examples of these sort of contingency flowchart type plans. And I think this is the only time I've ever seen someone say, Yeah, you are going to encounter something you didn't expect at your planning phase, but you need to plan for that uncertainty.
[00:07:47] Dr Brian Godsey: Mm hmm. Yeah, I'm pumped that you like that part because when I was writing the book I started thinking about how in terms of uncertainty in data science, everyone thinks about the machine learning or probabilistic models and statistics and probability models and all of these like sort of classical quantifiable uncertainties.
[00:08:02] And that is important to know, like, whether you have a prediction that's 51 percent accurate or 99 percent accurate, for example, people are very familiar with that. That's well studied. It's important to know what you're doing there, but really, the flip side of uncertainty is the project uncertainty that you're talking about.
[00:08:19] As I was writing that part of the book, thinking about starting to make these flow charts of basically saying, okay, like, do some data analysis here. And if, if the machine learning model does a really good job with all of the predictions, then we're done. And then, so that's that. But if it doesn't do a great job with the predictions yet, then you need to think, what will help that?
[00:08:37] Do we need a better model? Do I need more data? Do I need to. Retrain it somehow or whatever it is. And so if you get to this point where you know, you're not done yet, but you're asking yourself, what do I need now? Then you maybe go back a step and sort of try it again in some sense, but this flowchart of saying, stop, wait, ask yourself, am I done or not?
[00:08:57] And then saying, maybe I should try something else. And like, actually writing this down, I started asking myself, like, am I really making a flowchart? To tell people that they should stop and like, ask themselves, are these good enough? Are these results good enough? Or do I need to go back and change something?
[00:09:12] It's like, in some ways, when you say that to people, it's so obvious. And there's some people will roll their eyes at you because it's very obvious thing when you say it to them, it also becomes obvious. I think, as you implied it, that not everyone does that. So it's kind of a little bit of a cognitive dissonance with everybody understands that you need to check in and see if you achieved your goals or not.
[00:09:31] And if you haven't, then you need to go back and change something, but in the process of doing a project, not everyone really stops and looks to see what they have yet and what they need to do next. And so these flow charts of just a project, I mean, they could be in an engineering project.
[00:09:45] They could be in, any sort of science or any sort of. Projects really, there's nothing specific to data science in there, except for the case studies I was using, but they're just the fuzzy flow charts that show you how, to think about possibly planning for uncertainty in the future.
[00:09:59] But everybody tends to agree that those are sort of important in theory, but putting them into practice seems to be hard for a lot of people as well.
[00:10:06] Dr Genevieve Hayes: Mm, yes, I would agree. So when you wrote the book, that was, well, it was published in 2017, I assume you wrote it. 12 to 24 months before that. Obviously generative AI would not have even entered your mind back then because it didn't enter anyone's mind.
[00:10:23] Given the changes that have happened in the data science landscape, and also in your own life, because I'm guessing the work that you're doing today is quite different from the work you were doing back when you wrote your book,
[00:10:38] Dr Brian Godsey: definitely.
[00:10:39] Dr Genevieve Hayes: Which of the principles of your book do you believe have remained the same, and which do you believe have changed?
[00:10:49] Dr Brian Godsey: Yeah, that's a great question. As you mentioned, I wanted to talk about the data science process in the book and a lot of these things haven't changed. And I think that I had that in mind when I wrote the book, not just because I didn't want the book to be obsolete in 18 months when the latest software tools sort of went out of fashion or became obsolete for some reason.
[00:11:07] Even without generative AI in site, deep learning models we're gaining steam and lots of things were being done sort of easily that weren't very easy before in terms of achieving certain metrics for machine learning and whatnot. But the process itself, has pretty much stayed the same sort of in concept, because as I mentioned, so like asking good questions at the beginning of the project and
[00:11:26] asking the data questions these are things that are still good practice that we need to focus on. So the value of that is still there. I constantly talk about awareness of uncertainty and the book as well. That's. That's always true. Other things that are in the book are like statistics has not changed the way to learn a programming language for the first time and sort of approach data processing and data loading and all that stuff that really hasn't changed either.
[00:11:51] However, how they sort of balance and become important in various times has changed. And I think that generative AI has shown that, certain things can be much easier and much more efficient with generative AI. So in the last 10 years, that's like just orders of magnitude. Easier to do certain types of things.
[00:12:07] And so when we're working with projects often even though the underlying principles, as I was saying, haven't changed all that much, because, good questions are still good questions. Uncertainty is still there. How efficiently we can do things that's kind of the biggest change here in the last few years.
[00:12:23] And so certain things that would have taken a lot of work in the past are now much, much easier. And so if we can do this in a matter of minutes, instead of hours or days. Then that changes our decision like the mathematics of the decision making process in a project. And we might choose a totally different path because it's orders of magnitude faster now than it used to be.
[00:12:42] And so to me, that's the biggest difference I don't feel like I'm doing. Fundamentally different problems with different values, but I feel like the tools at my disposal, they just accomplish things at vastly different rates than I could previously. So it's just certain things are so much more efficient that I have to just change the way I approach certain aspects of the problem.
[00:13:01] Now,
[00:13:01] Dr Genevieve Hayes: Okay, so I'm trying to make this tangible in my own mind and help the listeners to make it tangible in their mind. Would you be able to just outline the steps from your process and how you would look at them in light of AI.
[00:13:16] Dr Brian Godsey: So the way I have the data science process laid out, it's in sort of three big stages, and Then each stage has a few steps in it. 1st stage is to prepare. So the 4 steps in there are set goals, explore wrangle and assess. And so setting goals that is unchanged.
[00:13:34] The goals of a project are essentially unchanged, although. With generative AI, we do have a bit more expanded capabilities so that there might be different goals given the capabilities that we have exploring, wrangling and assess are all factors of the data. So, basically, when we're going to be looking at the data, see what's possible, see what's in there to see what we can do with it.
[00:13:55] And so in concept, that doesn't change much either, except we know that generative AI, et cetera, can do a good job of wrangling data and assessing what's there and telling you about it. So that can be a lot more efficient in some cases, depending on the type of data set that you have. But we still have the same questions to ask.
[00:14:11] We still want to find out the same things. What's in the data? What can it do for us? And how do we make that happen basically, but generative AI can help with those steps. The second stage after prepare is to build. And so inside of the build stage is plan, analyze, engineer, optimize, and execute.
[00:14:29] So this is all about building models and then building potentially software products or deliverables and so based on what you want to do, it's kind of tool dependent. And so generative AI can change a lot of things in here, especially around the engineering, especially around the model building.
[00:14:42] If you can use a prebuilt generative AI model, then that might replace a lot of the analysis and the optimization and the engineering that you can do. But that means that the model needs to be able to do that. We need to be able to trust the model to do these tasks and. It's never guaranteed. Models make mistakes.
[00:14:58] We know they hallucinate. They do lots of things. You might tell it to do something and it does something totally different some of the time. And so we need to, worry about that. Always be aware of the uncertainty of possibility of mistakes and all that stuff. So definitely generative AI changes a lot of the engineering, a lot of the build phase of the data science process.
[00:15:17] But yeah. Again, a lot of the questions, a lot of the goals that we want to confirm that inputs and outputs are what we want them to be, and that the model is doing what it's supposed to do to deliver results at the end. The 3rd stage, I just call the finish. Which is the deliver, revise and around.
[00:15:33] And so these three are essentially unchanged from 10 years ago or even more because once we built our product, which could be a product, it could be a report. It could be just some sort of recommendation. It could be anything that's what we deliver.
[00:15:47] And we need to know what adds value to the business or adds value to the research or adds value to whatever this is. It's pretty much unchanged once we actually get the feedback from the stakeholders we might revise it in a very similar fashion. Generative AI doesn't really change much either.
[00:16:01] And then wrapping up same thing. We just want to make sure we get everything, all the loose ends tied up, put together. And those are very just sort of conceptual stages that they are tool dependent, but yeah, generative AI doesn't change that much.
[00:16:13] Dr Genevieve Hayes: If you were to write an updated version of Think Like a Data Scientist today, is there anything you'd add or do differently?
[00:16:21] Dr Brian Godsey: Absolutely. I think. Even though I was avoiding deep learning models and topics for a number of years, and not that I thought they weren't worthwhile, it's just they weren't worth it in the projects I was working on. So I feel like a lot of people were using deep learning for things that I would not have used it for.
[00:16:39] And I didn't use it. I opted not to. So in some sense, I found myself a contrarian of being, speaking out against these like deep learning models. However, that was a number of years. Before we had generative AI in any form. And so now that generative AI is there, I would absolutely.
[00:16:55] Feel necessary to include a lot of information about how best to use generative AI to either compress or replace certain aspects of the stack of the process, all that stuff. And so while I basically barely touched on deep learning, even though it was. Very present in data science.
[00:17:11] When I was writing the book, I think that it's much more relevant now to talk about generative AI because of so many things that it can do. Orders of magnitude faster than anything else before it. And now that I think about it, it's actually not a function of the data science itself.
[00:17:26] It's a function of the process. It's deep learning models may have been, as I was writing the book, they were state of the art, much better at classifying, much better at predicting, much better at, all of the data science y mathematical concepts which is great, but most people did not need that.
[00:17:40] Whereas generative AI actually speeds up the development process. So the data science process itself can be sped up or made more efficient by various aspects of using generative AI in that process. So generative AI has had a much larger impact on the data science process than deep learning ever did before.
[00:17:57] And so I think generative AI would be sprinkled throughout the book in various ways like here's something that could be made better, faster, more efficient. Than other things if you use generative AI. And so I think that would definitely be a huge change.
[00:18:09] Dr Genevieve Hayes: Well, if you think back to five years ago, at that time I was working in a fairly large government organization, and we didn't have access to the necessary GPUs in order to train deep learning models. And even though we had massive amounts of data, due to privacy constraints and security constraints, we couldn't put that data up into the cloud.
[00:18:34] So straight up. Even though we were a very large organization with a lot of data, we couldn't fit deep learning models. It just was not feasible for us. And Any smaller organization, what chance did they have? And these generative AI APIs and, other similar APIs that connect to pre built models, they have completely changed the landscape because they allow organizations that can't train their own models to make use of someone else's model.
[00:19:07] Dr Brian Godsey: Definitely, definitely the accessibility factor to the fact that literally, I was a PhD data scientist, mathematician didn't want to use deep learning models because they're too complicated and not saying it was too complicated for me to use them, but it's too complicated in the sense that the benefits that they give are definitely not worth the much longer development cycle, the much larger costs to say that, I wasn't using that for years, but then as soon as chat GPT and generative AI show up, Literally every consumer who has a computer can use it.
[00:19:37] It's just the accessibility is like many orders of magnitude different. And so, exactly. I agree with your point that this is so much more accessible now for everyone. Even if you don't have a specialized models that you might need for some of the more serious data science applications, the accessibility
[00:19:52] and the fact that it's generative as well. And it can apply to lots of generalizable problems like the number of capabilities that these generative AI models have sort of developed innately you know, reasoning as a factor of learning how the language works makes it so much more applicable than any deep learning models before that.
[00:20:12] Dr Genevieve Hayes: So, what is the single most important change our listeners could make tomorrow to accelerate their data science impact and results?
[00:20:21] Dr Brian Godsey: One of the things that I see is being a big difference between sort of an okay data scientist and a great one. Is the ability to get to the bottom of problems, like when something goes wrong, being able to figure out where it's coming from, hopefully diagnose it and then maybe fix it if that's possible.
[00:20:38] And it's not just we recognize there's a problem. It's actually recognizing the problem could be the challenge, like, realizing that something is going wrong could be the challenge itself. And then once you recognize something is going wrong, being able to dig in and get to the bottom of it
[00:20:51] that skill is, to me, incredibly valuable and sort of underrated, and so Just for people to focus on that, especially in the age of generative AI and other models that are just truly black boxes that we don't know what's going on in there much at all, but there are ways to poke and prod and see what else we can find that might indicate why certain things have gone wrong.
[00:21:11] And so I don't know if it's going to change. Anybody's career from one day to the next, but really retaining that problem solving mindset of realizing when there's a problem and then being able to really dig down to the ground level. Stay close to the data is what a friend of mine used to say. And so that means even with the most complex model, you want to be able to sort of trace the provenance of everything through get back down to the data.
[00:21:37] And it may turn out that. Something about the data or the way you're feeding it into the model or the way you're treating it is causing this output problem, the deliverable side. And so I think just really focusing on realizing when there's a problem, being able to get to the bottom of it, and then also just like working on your skills of being able to break a problem down into pieces and then being able to separate it.
[00:21:58] Facts and what you actually know from the opinions and the assumptions that you're hoping the tool works just fine. And , maybe it does until one day that doesn't. But if you don't stop and check to see whether it actually is doing the right thing and realizing there's a problem, then that's what I would say is really focused on the problem solving.
[00:22:16] Dr Genevieve Hayes: One of my favorite things to do as a data scientist has always been to try and break the models. So, just come up with anything I can think of that could cause the model to spit out crazy stuff and try and put that into the model.
[00:22:30] Dr Brian Godsey: I feel like lots of data or tech journalists these days are also trying to break all of the generative AI models and see what can they make hallucinations from? What can they get crazy output from? It's seems like a lot of people are having a lot of fun with that these days.
[00:22:43] So, yeah, I definitely can't blame you for. Having that hobby.
[00:22:46] oh yeah, so for listeners who want to get in contact with you, what can they do?
[00:22:51] So I am at brian godsey. com. That's just B. R. I. A. N. G. O. D. S. E. Y. dot com. And then I'm this brand God see all 1 word on most social media. These days, LinkedIn, I'm now on blue sky. And I write on medium sometimes, and I usually post those on LinkedIn as well.
[00:23:11] Dr Brian Godsey: And of course, I'm working on some, some sort of new projects these days, but the book is still out there. So think like a data scientist is still available and it's as timeless as I sort of hoped it would be in some sense. But yeah, I think that's about it.
[00:23:27] Dr Genevieve Hayes: And there you have it, another value packed episode to help turn your data skills into serious clout, cash, and career freedom. If you enjoyed this episode, why not make it a double? Next week, catch Brian's Value Boost, a five minute episode where he shares one powerful tip for getting real results, real fast.
[00:23:50] Make sure you're subscribed so you don't miss it. Thanks for joining me today, Brian.
[00:23:55] Dr Brian Godsey: Thank you for having me. It's been a lot of fun.
[00:23:57] Dr Genevieve Hayes: And for those in the audience, thanks for listening. I'm Dr. Genevieve Hayes, and this has been value driven data science.
Creators and Guests
