Episode 47: Leveraging Causal Inference to Drive Business Value in Data Science
Download MP3[00:00:00] Dr Genevieve Hayes: Hello and welcome to Value Driven Data Science brought to you by Genevieve Hayes Consulting. I'm Dr. Genevieve Hayes and today I'm joined by Joanne Rodriguez to discuss how techniques drawn from the social sciences can be combined with data science techniques to give data scientists the ability to understand and change consumer behavior at scale.
[00:00:24] Joanne is an experienced data scientist with master's degrees in mathematics, political science, and demography. She's the author of Product Analytics, Applied Data Science Techniques for Actionable Consumer Insights, and the founder of health technology company, clinicpricecheck. com, Joanne, welcome to the show.
[00:00:47] For
[00:00:47] Joanne Rodrigues: Hi, it's nice to be here.
[00:00:49] Dr Genevieve Hayes: most people, data science is synonymous with machine learning, and many see the role of the data scientist as simply being to build predictive models. Yet prediction can only get you so far. Predicting what will happen next is great, but what good is knowing the future if you don't know how to change it?
[00:01:10] And that's where causal analytics can help. Yet causal analytics is rarely taught as part of traditional prediction centric data science training. Where it is taught though, is in the social sciences. So by drawing causal techniques from the social sciences and combining them with what they already know, data scientists have the potential to generate actionable business insights in ways they never previously could.
[00:01:37] This idea of combining traditional data science techniques such as machine learning with causal inference techniques drawn from the social sciences is the basis for your book Product Analytics. And Joanne, I'll just say, I think this is one of the most original takes on data science I've ever come across.
[00:01:57] And I highly recommend this book to any of our listeners. But one thing I was wondering, your educational background is in the social sciences, but most social scientists I've met throughout my life have typically spent their whole careers working within that social science domain.
[00:02:18] How did you come to combine social science techniques with data science in the first place? Thanks.
[00:02:24] Joanne Rodrigues: So while I was in grad school at UC Berkeley so the wonderful thing about Berkeley is that, there's so many departments and, you can go to any department. So I spent a lot of time in, , the computer science building in, demography, in all of these different areas, probably, , not working on my PhD, which is 1 reason I didn't finish my PhD, but in going to some of the classes, some of the stats classes, I realized that, there were these amazing tools out there to really try to understand social behavior but they weren't really integrated into the social sciences at all.
[00:02:59] And then, on the other hand, social scientists had developed all of these tool kits around, kind of creating inference for social processes that were just not integrated with all these, interesting tools that we were seeing. In CS and in stats, and I think at that time, so this was in the, early 2010s data science was just becoming a thing.
[00:03:22] Right. And so you were seeing all of these computer scientists being like, Oh, there's all this data on the internet and we can do X, Y, and Z. We can predict, some random thing. And I was just like, it doesn't make any sense to predict that. Right. They were predicting like, let's rank news articles based on like, You know, it just it didn't fit in with kind of a paradigm that really made any sense.
[00:03:44] They were just looking for applications. And I think that's when I realized, oh, my gosh, there, these 2, different like, kind of ways of approaching or thinking about problems really need to be integrated better if we're going to get, the best results from, both machine learning and I think, to a large extent, causal inference, because machine learning does show what is sort of missing or different about causal inference.
[00:04:08] And not just machine learning, I mean, predictive inference in general, because there's forecasting methods. There's other types of, predictive modeling,
[00:04:15] Dr Genevieve Hayes: before we go too far in this episode, we should probably come up with a definition of what we mean by causal inference and predictive inference, because I remember the first time I was speaking to someone, a former guest on this podcast about causal inference, I sort of knew what he was talking about, but I had to then go off and look it up in a textbook.
[00:04:35] So when you talk about causal inference, what do you mean?
[00:04:39] Joanne Rodrigues: Yeah, so when I talk about causal inference, I really mean kind of understanding specifically what causes X, right? Or what causes some outcome? So, if we think about it in the experimental context, isolating a specific variable and understanding, what is the effect of that variable on a particular, outcome that we're interested in.
[00:05:01] Whereas, predictive inference is really can we predict an outcome? And then I also think there's 2 additional things you can also have comparative. So, and then observational, right? So a lot of the data that we see. Is kind of just observational, like, oh, you know, there were 25 users to our website yesterday, and there can be observational inferences, like, in the development of metrics or comparative inferences and, comparison of 2 metrics over time or something like that.
[00:05:29] Right? And then I think of, causal and then predictive is kind of the higher level building on your, comparative and observational and they're kind of separated, ways of thinking about slash looking at. Data
[00:05:44] Dr Genevieve Hayes: Yeah, what you've just said there reminds me of, there was a section at the start of your book where you talk about the different types of insights you can have predictive, causal, observational, and comparative, and that really hit home to me because I read your book shortly after I'd read.
[00:06:02] Effective storytelling by Brent Dykes. And he also talks about insights in that. And many of the insights he's talking about are observational insights. And then reading your book after that made me realise, okay, yes. Observational insights can be useful, but you have to be careful about drawing causal conclusions from observational insights when you haven't actually proven that by an experiment.
[00:06:32] And I also read Lean Analytics, which follows on from the Lean Startup by Eric Riess, and they're very That was also talking about, you have to be careful about pseudoscience, which is the observational insights that people turn into causal insights. And, it was interesting reading all three of these books, one after the other, because it just, created this whole picture for me about, oh yeah, causality is really important.
[00:06:59] Joanne Rodrigues: Yeah, and I think like the human brain kind of naturally goes to causal inference, right? Like, we're always just like, oh, well, this happened because this happened, we always kind of think in these causal chains, like, oh, well, because, I, you know, I guess, You know, ate too much at lunch.
[00:07:15] Now I don't feel well. We were always sort of thinking at that level. And so I think that it's natural for us to then bring those paradigms to whatever we're looking at. And so you see this in almost every workplace. And I'm sure, in a variety of other contexts where people are just like, Oh, yeah, of course.
[00:07:31] Well, we see this numbers up. So therefore that explains why that number is up to when it. Yeah, it's just not, Really related at all. So
[00:07:40] Dr Genevieve Hayes: Well, if you look at human superstition, obviously those were badly drawn causal inferences. Like if a black cat crosses your path, then bad things will happen. Obviously that happened to someone at some point.
[00:07:54] Joanne Rodrigues: yeah, absolutely. And maybe there are some like good, superstitious things. Like, I don't know, maybe around germs. Like, when you read about like ancient society, sometimes they segregated, women who had just given birth and I was like, Oh, that's probably because they understood like bad things happen, but it became this kind of superstitious paradigm.
[00:08:13] And I can't remember which culture this was, but essentially, yeah, they would segregate pregnant women. For a period of time, right after they had given birth. And that makes a lot of sense, right? From now that we kind of understand the causal relationships with, germs and, all of these different things.
[00:08:30] But, before that was understood, it was explained away by superstition or by myths.
[00:08:37] Dr Genevieve Hayes: Well, the superstition that I always follow is don't walk under ladders. Not because I think it's bad luck, but because there's a good chance that there's someone with heavy tools at the top of that ladder that they might accidentally drop. And every so often I've seen people drop things from ladders and if you're under it, it's going to hit you on the head.
[00:08:57] Joanne Rodrigues: Yeah. I think that's a great example of, A clear causal relationship there.
[00:09:03] Dr Genevieve Hayes: But going back to what you were saying about segregation of, Pregnant women after birth there was this gentleman, I can't remember his first name Semmelweis. He was a Hungarian obstetrician who was working in a maternity hospital in Austria. And this was prior to the discovery of germ theory and in this hospital, there were two clinics.
[00:09:27] There was a clinic where doctors delivered the babies and a clinic where midwives delivered the babies. And the mortality rates were much higher for the doctor's clinic than for the midwife's clinic. So straight up, you've got a perfect experiment here. You've got two groups. And he basically realized that it was germs causing the problem because the doctor's clinic, the doctors did autopsies in the morning before they delivered babies and they didn't wash their hands.
[00:10:02] And the midwives weren't dissecting dead bodies. And so as a result, they didn't have these horrible germs on their hands. So the doctor's clinic was much more dangerous than the midwives clinic. And he basically tried to get the doctors to wash their hands before they delivered the babies. And it sort of worked for a while, and that's how it went.
[00:10:26] brought the mortality rate right down. And then the doctors didn't like the idea that they were doing something wrong and fired him and etc. But that's a perfect example of causal inference, I think.
[00:10:39] Joanne Rodrigues: Absolutely. And I think that, for the variables that are most causal that are just extremely causal, you can kind of see them in the real world. And so this is true, like, for smoking and cancer as well. Right? So, originally there wasn't a way to really, sanction the experiment for smokers and non smokers.
[00:11:00] And so when statisticians were looking at it, they were like, why are chimney sweeps 200 times more likely to die from lung cancer than, other occupations. So you see with certain things that are extremely causal, and there's kind of a clear mechanism, like, with germ theory, or with, smoking and lung cancer that, sometimes you can kind of pull out those causal effects.
[00:11:23] But I think the most difficult causal effects are the ones that are a little, like, mildly causal or the variables that are, more causal. I guess the way I think about it is that I think, like, for social processes, a lot of variables have, a mildly causal effect and then there are a few variables that are just more causal that you can potentially alter or change a little bit.
[00:11:47] And then, of course, for some processes, you have these extremely causal variables, but they're not, really rampant. There's a few things where it's really obvious and kind of everybody knows, but generally with processes that you're studying, you don't see these, like, smoking.
[00:12:02] Causing cancer kind of relationships, right? You see more like, oh, there's all of these different elements. Like, there's so many different factors what's going on and like, how do we tease apart? Which ones are a little bit more causally related than others? Or, is there a spurious relationship going on with the 5 of these and the 3 of the, you know,
[00:12:22] Dr Genevieve Hayes: when did you first start implementing these causal techniques in a data science environment? Were you first hired as a data scientist or were you first hired as a social scientist?
[00:12:33] Joanne Rodrigues: I was first hired as a data scientist. So this was, very early on where there weren't very many data scientists. And I worked in Silicon Valley and so everyone was like, Oh, algorithms, algorithms, algorithms. And they had hired a lot of physicists to run their early data science models.
[00:12:49] , and these were for things like selling women's clothing, right? And at that time there wasn't this belief that, oh, you should have domain knowledge. It was just like, oh, the algorithm is right. Like we just apply the best, most sophisticated methods and that will get us the right answer.
[00:13:06] And that was kind of the belief system. And I think that that's changed a little bit, now you see, domain specific data scientists and you see, domain specific models and all of this stuff. So it's definitely improved from where it started.
[00:13:19] But I do think there's still this, hype around, the most advanced. Methods and if you really look at data science or machine learning tools, right? They haven't changed a whole hell of a lot, since many of them were developed, like, the perceptron if you think of like, really, from a technical perspective, yes, there's been marginal, delta changes, but fundamentally, there hasn't been huge changes in, the underlying algorithms themselves.
[00:13:45] Dr Genevieve Hayes: Most of the algorithms that we commonly use, so even things that are sophisticated like neural networks, they date back to the mid 20th century. They're not actually anything new.
[00:13:58] Joanne Rodrigues: Absolutely. And I think, I don't know if this is something that I discussed in depth in my book, but while I was at Berkeley, so in the computer science department at that time, there are all of these, professors who had developed some of those original models and see, they shared, why did they develop them?
[00:14:15] Like, what were the problems? That they were interested in, and a lot of the problems were like image classification and digit classification and all of these different things, which are really clear outcomes, right? A lot of these problems are not like social processes or these more convoluted things.
[00:14:31] And so really, there hasn't been. To some extent, there's been a little bit in terms of, fuzzy algorithms, and some of these things to think about, more kind of convoluted outcomes. But there hasn't been that, like, thought process for the generation of algorithms, I think related to these more, convoluted outcomes or convoluted processes.
[00:14:50] So, that's just, something that I kind of recognized. Well, sort of sitting through these classes and
[00:14:56] Dr Genevieve Hayes: That point you raised about social processes. That was something that had never struck me before reading your book, but once I read it, it was like, Oh my God, that's just so obvious how was I blind to it? Because when you read your standard machine learning textbooks, everything, single textbook uses exactly the same data sets.
[00:15:17] So you've got your iris data set, your cats and dogs data set, your alcohol content of wine data set Sentiment analysis of news articles. , everyone's seen all those datasets. They're built into SKLearn in Python. And I used to wonder, why is it that everything works out so perfectly for all of these datasets in the textbook?
[00:15:39] And then you go off and get a real job and try and apply them in that job. And nothing works as nicely as in those, Repackaged data sets. And it was only when I read your book that it made me realize all of those are based on physical processes or physical relationships, but businesses aren't interested in those physical relationships.
[00:16:02] They're interested in social relationships.
[00:16:05] Joanne Rodrigues: Yeah, absolutely. I think the heart of, commerce, the heart of, I guess most of the data out there, like, the terabytes of data is about, , social interaction between individual users. And I think just like you when I first encountered that in the workforce, I was like, so surprised that people were treating these online interactions.
[00:16:22] Like, they were real social interaction. And you just don't when you're looking at the data as a data scientist, you just don't see that. Right? You don't see. Oh, like, people are actually making friends on these applications. People are, even like dating off of these applications.
[00:16:36] These are very real social experiences for people, but because you're just looking at it as data. You're just like, what is this? Because when we have social interactions. We don't see data coming out of that. So you never make that connection. But I think that's very, very, very true.
[00:16:52] And like the way that, obviously like business works is through relationships, through sales, through a lot of like non logical processes
[00:16:59] Dr Genevieve Hayes: yeah. We're not trying to predict what species a iris flower is based on the dimensions of its petals. Right.
[00:17:07] Joanne Rodrigues: And I think that the social sciences have been plagued with this issue too. So I don't know if you've heard of the Klein McCleskey
[00:17:14] Dr Genevieve Hayes: No, I haven't.
[00:17:15] Joanne Rodrigues: So this was an area of sociology where So the Klein McCleskey algorithm was actually developed for looking at circuits. And circuit systems, right?
[00:17:26] So you can think of circuits are either on or off. And at the end of the circuit, you maybe want to turn on a light bulb or something. And so they want to figure out what circuits can you turn off or what's the most parsimonious or minimal circuit listing that you can turn on the light bulb at the end or whatever.
[00:17:40] Whatever it may be, right? So they took this algorithm to kind of define social processes. And the thing about circuits, is that circuits are directly causal. So if you turn off a circuit somewhere, and that's needed to turn on the light bulb at the end. The light bulb is not going to turn on.
[00:17:58] So they basically applied this like perfectly causal system to the real world and then found out this doesn't work. This doesn't make any sense, right?
[00:18:07] So I think that the difficulty here, right, is that we're looking at this system that's, perfectly causal, but then in social systems, you have many, many, many more elements, you have randomization, you have randomness, that's throughout the system, and so it's very hard to kind of take this, causal system and place it upon, like, a social system.
[00:18:28] And that's kind of what they found out that you're falsely thinking things are causally related when they're not, they're just correlated and it won't actually cause something or not cause something. So I guess you can think of, an outcome, like let's say we're trying to figure out, what's leads to, life expectancy. So, let's say, 1 of the causal links is, eating healthy or something, right? Which we think is probably causal, but because of, let's say there's other things, like, a car accident and someone, just randomly, Dies like there's randomness in the system.
[00:19:02] So maybe eating healthy on average leads to certain outcomes, but it's not like this direct 1 to 1. like, oh, if you eat healthy, you will live X number of years or whatever. It may be
[00:19:12] If we have, very clear, directed causal process, then it's easy to say, oh, if we remove this, or we pull this lever, then this will happen. Right? We don't need experimentation in that case.
[00:19:26] But then when we have, this, complex convoluted processes, we need experimentation to deal with some of these issues of all of the different variables, you know spurious correlations all of the things essentially
[00:19:41] Dr Genevieve Hayes: Having done training as both a data scientist and a statistician, one of the things that was really hammered into me during my training was that whole correlation does not equal causation mantra. And I teach a data science class and I. have my students basically reciting that as well. Anyone who's taken one of those data science classes can give all those examples of spurious correlations and, you know, things like the number of movies Nicolas Cage has made correlates to the annual number of deaths by drowning in the U. S. for some period of time.
[00:20:19] And clearly, Nicolas Cage isn't going around drowning people every time he makes a movie. though you have that drummed into you, I don't think it's ever spelt out to you in those classes what the implications of that are. And, The thing that only just clicked in my mind recently was supervised machine learning is a correlation based technique.
[00:20:46] And so because correlation doesn't equal causation, that means we can't draw the conclusion that just because in your supervised learning model, if X goes up, Y goes up, you can't draw the conclusion that X causes that movement in Y because you've got a correlation based model, not a causality based model.
[00:21:09] Joanne Rodrigues: Yeah, absolutely. I think that's the heart of it. You can have these great spurious relationships that, predict future outcomes. But that doesn't mean that, if you change those things that it will change the outcome. I think that's the fundamental difference.
[00:21:24] And you can think of it in terms of, like, especially with social things. A lot of things just go together, you can think of, like, earnings and maybe, let's say, , financial health, they're probably very highly related, obviously, but they may not truly be causal.
[00:21:40] Like, something else may be, driving both, earnings and great financial health, for instance, education or, motivation or whatever those underlying factors are. So in the real world, I think you can often find these great Things that kind of track each other over time but they might be slightly causally related,
[00:21:59] in this case, probably slightly causally related because obviously if you have higher earnings, you probably have, a better chance at good financial health, but not necessarily like as large of a causal driver as one may think.
[00:22:12] Dr Genevieve Hayes: You have the examples of the megastars, you know, football players, movie stars, et cetera, who are earning more than either you or I could ever imagine having in a lifetime and who are broke
[00:22:24] Joanne Rodrigues: So, I think that oftentimes because they sound so interrelated, people think, oh, yeah, they're super causally related. And I think that's where you see the most, mistakes where it's like, oh, that's possible. And then they tell that to their boss and they're like, oh, yeah, now we've understood the whole system guys.
[00:22:43] Dr Genevieve Hayes: I was preparing an example of that for a class I was teaching and for a period of time there was a correlation relationship between the number of movies, Disney releases in each year and the divorce rate in the UK. And I actually put that into ChatGPT and asked ChatGPT, okay, provide me with a plausible explanation about why this is true, , just to see.
[00:23:12] , can you come up with a story? And, Chachifute is like, yeah, this is probably spurious correlation. And I'm like, yeah, yeah, I know it's spurious correlation, but just humor me here, come up with an example. And it actually came up with a, Pretty good explanation for it, it was the more movies Disney releases each year, this puts strain on relationships because the parents feel that they have to take the kids to see the movie and then they have to buy all the tie in toys and that.
[00:23:42] And as a result, that financial strain causes strain on the relationship. So it results in more divorces. And it's like, Okay. I can go with that.
[00:23:53] Joanne Rodrigues: Yeah, I mean, I think that that's like, very true. You can always build the, causal mechanism is what happens. And here's what happens like that causal chain. Right. But I think it also offers you the. Chance to, test that causal chain to when you think about it that way, you can think about are there any experiments that you can create so that you can look at it does, actually lead to why, like, there's intermediate connections.
[00:24:21] So that's another way in which, like, yes. Everything you can probably come up with this nice, like, long causal chain, but then it also is useful in thinking in that way, because then you can test and potential causal relationships. But yes, no, I totally understand what you're saying. It's very easy to build , these very kind of convoluted causal chains and people do it all the time, especially in their own life.
[00:24:46] I think that probably.
[00:24:48] Dr Genevieve Hayes: And once you come up with a plausible story, you'll hang on to it because yeah, of course the parents are spending too much on Disney toys and it's causing their divorces. What else could be causing it?
[00:25:01] Joanne Rodrigues: Yeah, I mean, I think there is something to, plausibility. And then also, is the story memorable, , there's all of these cognitive biases that we have that lead us to, believe certain things over others, and it may completely belie reality,
[00:25:15] but, It sounds like, oh, that could make sense. Oh yeah, of course. I've seen that in my own life. Da da, dah. All of these different biases start to come out when you hear specific.
[00:25:26] Dr Genevieve Hayes: So One of the comments that you make in your book is that causal inference is much harder to conduct than predictive inference and is more appropriate for one off cases and for overarching models. What are some of the challenges involved in causal inference that make it so much harder to conduct than predictive inference?
[00:25:47] Joanne Rodrigues: I think the core difference is that to a large extent, a lot more thinking that has to go into causal inference, because it's all driven by design. So with a lot of the techniques, you have to think about how do we create kind of this experimental framework from observational data and observational data is, data that has not been randomized, , non experimental data and which is, the majority of data that we have essentially.
[00:26:12] And so it forces you to really think through, can we, Operationalize randomness in our environment somewhere. Can we think about how we can kind of create a natural experiment or find some, pseudo randomization that we can operationalize to, understand this relationship.
[00:26:28] So, it really forces you to think and come up with these unique designs and the only exception to that is probably the idea of propensity score matching or where you're kind of taking anything, but oftentimes with propensity score matching or, Jen match there's also other matching algorithms that improve your matching between treatment and control groups.
[00:26:48] But still, it's very hard to get balance. Where your treatment and control look similar, even if you don't get balance, you can kind of figure out like, hey, these are the variables where we're unable to get balance. For some reason, there's a strong self selection on these particular variables where there isn't a lot of, coverage in the treatment group or in the control group.
[00:27:10] Where we're seeing the ability to really, try to create comparable treatment and
[00:27:16] Dr Genevieve Hayes: So you're talking about observational experiments, which are quite different from, say, the standard medical trial research experiment that We're all used to, you know, the one where you've got the group that gets the real medical treatment and the one that gets the placebo. With these observational experiments, do you actually set them up before you start collecting the data?
[00:27:40] Or do you just say, we've got this data, how can we turn that into something that we can draw experimental results from?
[00:27:47] Joanne Rodrigues: yeah, so, I want to preface this with saying that. Yes. A B testing and, actual experimentation is a gold standard. If you have experimentation and can do experiments, that's what you should do. If you can't, then this is a secondary pass. I think that it can really go either way.
[00:28:04] So I think that, it's best obviously to essentially be able to think through a, experimental design and then go and find the data to map to that. But it is also possible for you to have collected all this data and then you realize, Hey, Oh, you know, we can actually exploit this geographic boundary because, it was basically drawn randomly.
[00:28:28] It's a straight line through X, Y, and Z. And so on both sides, it looks like they mirror each other so they can have treatment and care.
[00:28:36] Dr Genevieve Hayes: That's sort of like the maternity hospital with the doctor's wing and the midwife's wing.
[00:28:42] Joanne Rodrigues: Yeah, so when you said that. I was like, Oh, wow, that's a really interesting, way to think about a natural experiment. And as you can see with that, it's not truly randomized, right? Like, there are probably things that also differ.
[00:28:55] Between the doctors and the midwives, education, training, all of those things, but they were able to kind of remove that from the analysis because, obviously, the midwives have less training. So not really remove that, but, kind of deduce that that probably wasn't what was driving the effect.
[00:29:14] Right? So I think that that. Work sometimes, but you can see that there's still variables where they weren't able to, get balance on. So the best would be to have, I guess, 2 doctors, 2 groups of doctors, 1 that wash their hands and 1 that didn't you could really operationalize that specific variable.
[00:29:34] So I think in. Understanding that he was able to do what humans do best, like kind of identify heuristics and figure out patterns and remove certain variables in his head, which can be done to deduce causal relationships, but it's not like this perfect, experimental framework.
[00:29:52] It's not even truly a natural experiment in that sense.
[00:29:55] Dr Genevieve Hayes: And I agree with you, AB testing, it's the gold standard, but there are some situations where it would be unethical to have a proper AB test. You can't say, okay, we're going to force these children to be raised in terrible conditions just to conduct an experiment.
[00:30:13] So, yeah.
[00:30:14] Joanne Rodrigues: Yeah, no. And that's why my book, I talked about teen pregnancy, because I'm just fascinated by like, I thought that. Economics paper, where they, operationalize natural abortions was so interesting for that reason, because of the idea that, there can't be any kind of natural experimentation around teen pregnancy.
[00:30:31] You can't go to a group of teams and be okay, you know, you're not right. I know that sounds really . Yeah, extreme, but yes, there's so many things in this world, especially social things, especially policy related things and things that really matter, for people's trajectories. You can think having a child young is going to really affect their lives.
[00:30:55] Right? So it is very much, , important to study that and study the effects and see if there are ways that we can make it better as a society for , teen moms. So, yeah, absolutely. There are definitely areas in which we cannot have testing and often the most like impactful areas are where it's very difficult to test.
[00:31:15] Dr Genevieve Hayes: Since reading your book, I've gone on to read several other books about causal inference, and those books have tended to be very maths heavy. They focused almost exclusively on the quantitative techniques that underpin causal analytics, which are important if you want to implement it. But one thing I really liked about your book was that although it does provide those quantitative details, there's also a strong emphasis on the importance of having good qualitative frameworks in place before any analysis can begin.
[00:31:50] So why are qualitative frameworks so important when applying causal techniques?
[00:31:55] Joanne Rodrigues: And I wouldn't say just causal inference, right? I would just say even predictive inference, it's important for us to really understand, the domain, understand what are we trying to predict? Or what are we looking at? What are our treatment variables? What are our outcomes?
[00:32:11] Really having a holistic understanding of what we're studying and why. And so I think in the beginning of the book, I talk about this within the scientific method, causal inference came out of the scientific method. That's why it's the gold standard. It's what you see , in the sciences.
[00:32:26] And that's why, the social sciences have tried to pull it in. So you have that lineage. And then within essentially the, framework of how do you build theories and hypotheses and that whole, process, I think whether you're looking at, predictive or causal, you want to always, map to your, business insights,
[00:32:44] that's your North star to map to, like, Business insights, you need to understand, why you're doing what you're doing, what is the broader business question? Or what is the broader framework in which this all exists in? So it's almost like this top down approach. And I think that once you start with the top down approach, when you're looking at causal.
[00:33:04] Analyses, it makes the design more clear. So, what your, treatment variable should be like, what can the business really change? Right to drive, let's say, revenue or drive customer growth. So those are the things that you might want to look at as treatment.
[00:33:19] And then you want to go through what's your historical observational data? Are there things that you can't put through experiments? And then is there anywhere where we can operationalize randomness? If we can't, can we try other methods? To kind of get at, whether. These individual factors are causal and maybe how big that causal effect is.
[00:33:39] So that's kind of how I would go about, , so starting from, the broader, theory and business question, and then come down and the same thing with, predictive inference. And I think it's almost more important for prediction because you want to make sure that you're. application is truly predictive, so that, prediction is the driver,
[00:33:58] you want to be able to predict revenue. You want to be able to predict how many chairs you need in a school, so you need to kind of have an understanding of what your school population will look like in 5 years. So those are , true predictive questions. But if it's like, Oh, you actually want to drive consumer growth and you're looking for a prediction to figure out what variables were driving consumer growth.
[00:34:20] That's just a bad idea. That's not the right approach. And so I think starting from the top down, starting from the Theoretical framework of what are your core concepts, , and then how are you operationalizing those concepts with different metrics? What are your indicators?
[00:34:35] Having that full kind of social sciency paradigm really helps you get at 1st of all, what are you actually measuring? What do you have data around? And then how does that relate to your core business?
[00:34:46] Dr Genevieve Hayes: that. So it's basically don't make the same mistakes that data scientists make when they're fitting predictive models, which is don't just run straight to the fun part, which is doing the maths. You've got to do that whole pre work first if you want to get meaningful results at the end.
[00:35:02] Joanne Rodrigues: Yes, absolutely. And I mean, I think there are like some things that are, clearly predictive, maybe like recommendation systems but a lot of things are kind of this gray area of like, wait, what is the purpose here?
[00:35:15] Dr Genevieve Hayes: You don't include number of movies Nicolas Cage has made in the last movie in your model to predict drownings in the US.
[00:35:23] Joanne Rodrigues: Yeah,
[00:35:23] Dr Genevieve Hayes: So another thing that I wanted to discuss, which I thought was really interesting in your book, was your book isn't. strictly about causal inference, although it is a substantial part of it.
[00:35:37] You do actually also have sections about supervised learning techniques such as, I think it was linear regression. Was it linear regression, KNN and support vector machine? I think that was right.
[00:35:51] Joanne Rodrigues: yeah, I think I go over linear regression, logistic regression decision trees and support vector machines.
[00:35:57] Dr Genevieve Hayes: All right. Okay. So I like that idea of, basically using the causal tools that come from social sciences alongside those AIML tools. Could you provide the audience with a bit of a background on how these two tool sets can work together?
[00:36:17] Joanne Rodrigues: There's a lot of, , academic work that's being done in this space. Especially now. So I think this is actually 1 space where we've seen a lot of growth in the last 10 or 15 years. So I don't exactly remember when propensity score, matching was developed, I think it was in the 80s, but that may be incorrect, but essentially, a lot of the, newer algorithms to do matching are essentially machine learning algorithms to look for.
[00:36:45] Oh, what control group member looks like this treated, so it's doing that kind of search. And you also see this with graphical models, causal inference graphical modeling has, kind of urgent and so you also see that in that space where there's that application.
[00:37:01] Another application that I talk about in my book is uplift modeling, which is again, an integration of kind of machine learning techniques as well as. Causal inference, and that's to look at individual treatment effects. So I think an area in which they're really trying to push forward in causal inference, , from a technical perspective is the idea of, trying to get at subpopulation treatment effects, essentially, or individual treatment effects.
[00:37:26] So. In the current framework with, , propensity score matching, or even in an experiment, you can think about it. , you're getting the average effect. Like, you're looking at the difference between the treatment group and the control group and you're saying, oh, that's the effective treatment.
[00:37:39] But, in reality, , you would want to know what is the effect on an individual, so, , let's say you take a drug, right? So, , let's say the study tells you, oh, if you take this drug you feel better, within 7 days. Well, that's, , the average, right?
[00:37:54] But let's say you have all these different personal factors, maybe certain health issues. And if you take the drug, you don't feel better for 12 days or whatever. So you want to know what that individual treatment effect is. That would be the optimal thing, but that's really hard to get at.
[00:38:09] So essentially uplift modeling is another technique where you're essentially using machine learning to estimate the individual treatment effects, which are really subpopulation treatment effects.
[00:38:19] Dr Genevieve Hayes: Does that get used often in practice? Because it sounds very useful.
[00:38:23] Joanne Rodrigues: It really doesn't I think a lot of these methods really don't get used very often and I think it's because again, you have to think about the design 1st. So you have to have experimental data or something to use with uplift modeling, so you need to run the A B test or you need to have , at least pseudo experimental data that you're very confident in to really kind of jump into uplift modeling. And I think that once people run , like, a B tests are like, oh, okay, I think we're done guys. So yeah, so you don't really see that being used a lot. Although it's really a pretty powerful technique.
[00:38:59] Dr Genevieve Hayes: Which techniques do you find get used the most in practice?
[00:39:03] Joanne Rodrigues: It's, , the most, obvious techniques, like operationalizing randomness that people find , in the world. So, , thinking about geographic boundaries, thinking about time issues, like you were saying, and sometimes, like, so I talk about, I T S the interrupted time series where there's like a change at 1 point and so you can look at the prior and then the post and I think those are really, really effective.
[00:39:27] If you want to convince , an executive, . And there's a clear effect of something like that can be extremely powerful. So sometimes you get these wonderful things in causal inference, where it's just, it's obvious like smoking causing cancer. Like it's such a jump in the data that you're just like, wow, this can be, clearly stated or graphed oftentimes.
[00:39:47] , you don't necessarily get that. And that's where I really do think, and maybe this is not used very much, but I do think propensity score matching and some of the matching algorithms are really useful to understand , where you don't have balance, like, why your treatment group doesn't look like your control group.
[00:40:02] Like, where is selection happening? All of those things, , you can start to get a better picture of just the context.
[00:40:09] Dr Genevieve Hayes: for listeners who are just wanting to get started with causal inference, if you had to recommend, say, one to three techniques that would be the best ones for them to focus on. What would they be, just to get the most bang for their buck.
[00:40:26] Joanne Rodrigues: Probably, difference and difference methodology also looking at just a natural experiment and understanding how to just apply, like, the experimental design on non experimental, , results like randomness that you find in the real world, so that would probably be the 1st, then difference and difference.
[00:40:44] Then, I would also suggest, , propensity score matching. I think the whole idea of understanding, , how do you create treatment and control groups when you don't have randomness helps you understand why randomness is important so much more. And I think that. Yeah, those three techniques would really be where I would start.
[00:41:02] Dr Genevieve Hayes: For many data scientists, they're probably working in teams where most of the analytics that's been done in that team is predictive analytics or descriptive analytics. So actually starting to use , these techniques would involved. A marketing element as well to convince management that this is a good way to go.
[00:41:24] Based on your experience, can you give them any advice in how to actually sell? causal inference to the senior management in their organizations.
[00:41:35] Joanne Rodrigues: So I actually disagree with that in the sense that I think it's easier to sell causal inference insights than it is to sell predictive insights. As long as you're actually saying what they are, cause causal inference is why something is happening, which is just so much more natural for people to understand and so much more .
[00:41:56] People just, I think, immediately are like, oh, yeah, and I think that you can use, graphics, like, graphical modeling and all of that so much better with causal inference than you can with predictive. it's hard for me to think of a predictive. Case in which you can, easily demonstrate some clear business insight, but with causal inference, that's very possible.
[00:42:18] So I think some of the clearest, causal applications are with, time like I was saying. So having that, , ITS graph. It's very powerful, especially for an executive because they're like, Oh, well, you ran this marketing campaign. And prior to this, you had this.
[00:42:32] And prior to that, you had this. And then in your placebo, it was a straight line, right? that's a clear example that anyone can understand. So. I think in a lot of ways, causal inference is just much more understandable to executives. I think sometimes predictive analytics is pushed off as causal or, , explained in these, , very, , hyped up terms, which make executives like, oh, yeah, we're doing the latest, , MLAI, this generative AI model but I think when it really, truly comes to business insights, causal inference definitely has the advantage.
[00:43:05] Dr Genevieve Hayes: What you were saying just then about how a lot of data scientists and businesses try and present predictive insights as though they're causal, remind me of a few weeks back, I saw this webinar where. A data scientist was doing this demonstration where he's fitting a predictive model and then at the end started drawing all these causal conclusions from it.
[00:43:33] I was just horrified, it's like, I can't trust anything this guy says after this, because he clearly just doesn't get the difference between predictive and causal and what else is he making mistakes on?
[00:43:46] Joanne Rodrigues: Yeah, I think that that's rampant. I also think that so this is what I keeps me up at night. The fact that really that, predictive modeling can be very discriminatory on so many levels. And it's driving, what we're seeing on the Internet. It's driving recommendation systems.
[00:44:03] It's driving all of these different things , When we engage with technology, and to me, that's really scary because I think with causal inference, , you can put in, , demographic variables. You can put in all of the stuff to kind of it's it's generally not discriminatory.
[00:44:19] But when you have, predictive algorithms that give different results for 1 group and different results for another group. That, to me, is very scary, and I think you see this a lot, predictive algorithms are now being used to price houses. So you can think, in certain areas that may be more economically depressed.
[00:44:36] What are they using to create value in those neighborhoods versus other neighborhoods? You can see, the myriad of effects that this can have on people's lives. And nobody is really checking, are these models truly discriminatory and, appraisal values can have huge effects when you think about home equity, when you think about all of these other things related to that.
[00:44:56] So I'm not 100 percent sure how this works in Australia, but, in the US, there's so many applications of these algorithms that, are rampant in law enforcement and all of these different areas that have a huge effect on people's lives.
[00:45:09] And I think a lot of the people who are, implementing those algorithms are not understanding that it's correlative and it's not causal. Like you were saying, they're like, oh, this is great. It has perfect prediction and it's, , meeting this and, they're just using these, technical justifications when in reality, the data probably isn't perfect.
[00:45:27] It's already biased because It's probably more likely to be collected in certain areas than another area. There's all these underlying data collection biases that are going on. And then on top of that, you have, models that are essentially black box models that are not transparent that are, , maybe proprietary, like they're coming from companies with underlying incentives that they want to see X, Y, and Z.
[00:45:53] Dr Genevieve Hayes: The case with the LLMs that power things like ChatGPT. No one knows what's powering it under the hood.
[00:45:59] Joanne Rodrigues: Yeah, and then I think, the question is fundamentally, what are the use cases of this then? Where is it being used? And what is it giving back to us? And how is that affecting our lives? And I think that, these questions are just starting to surface.
[00:46:14] Really?
[00:46:15] Dr Genevieve Hayes: I think any data scientist who is building a model should have to meet directly with anyone who has, or at least some of the people who are negatively impacted by that model, because. Early on in my career, I was involved in works compensation pricing, and it was a generalized linear model so you could actually see how everything worked.
[00:46:41] And I was the pricing manager, so I was the most senior person in the organization who could actually understand how the model worked. So whenever my manager got a complaint about. the premiums, I would always be brought along to the meetings. So I had to sit opposite some of these people who because of this model, This had cost them this amount extra each year.
[00:47:06] And I heard their stories about how, because my premium went up, that meant that we had to lay off this many staff. And you just don't see that when you're just looking at data and a model, what the human consequences are to building that model. And when you've sat opposite someone who.
[00:47:26] it's telling you about how their business is going under in part because of your work. At least makes you realize that everything has consequences, even if it's just looks like data and models when you're creating it.
[00:47:43] Joanne Rodrigues: I mean, I think that's a fantastic example and I have not seen that really in the U. S. at all. Like, there isn't this push for greater transparency. There isn't really a lot of push to, figure out who is the most affected by these models. Like, they haven't even taken the 1st step there in that, so I hear you and I think that that's a fantastic idea.
[00:48:08] Dr Genevieve Hayes: So what final advice would you give to data scientists looking to create business value from data
[00:48:14] Joanne Rodrigues: I guess truly I think the first thing is like I was saying, top down. So I think a lot of things can be solved by like simpler approaches. We don't need the most advanced algorithm for X, Y, and Z, unless there's a really strong justification for that. I also think really understanding the domain is often undervalued.
[00:48:36] So. I always try to find the most knowledgeable person I can find and, just talk to them and try to just understand, what are the underlying incentive structures? Why are people doing what they're doing? In this specific domain? And I think last would just be data science is fundamentally creative,
[00:48:54] it's not truly a science. And even I would say this, even mathematics is, to some extent creative, I think and I would probably need to fact check this, but while I was doing my master's in math, somebody told me that Egyptians actually used a fractional base. So, instead of , base 10 that we use, they had a fractional system and I found that to be fascinating.
[00:49:18] And that's when I realized, wow, even math is fascinating. Mathematics like something that we think of as so kind of rigid and a hard science is fundamentally creative. And I would say that data science very much is the same way. And also, obviously, garbage in garbage out, you can kind of create anything with data.
[00:49:34] So, be responsible.
[00:49:37] Dr Genevieve Hayes: and for listeners who want to learn more about you or get in contact, what can they do?
[00:49:42] Joanne Rodrigues: Yeah, so I respond to my LinkedIn. I also have a website actionable data science dot com, which you can send me a message. I welcome any kind of communication, any thoughts, any ideas. I love to hear use cases. So definitely reach out
[00:50:01] Dr Genevieve Hayes: And I encourage readers to read your book, Product Analytics, because it will change the way you think about data science.
[00:50:09] Joanne Rodrigues: awesome.
[00:50:10] Dr Genevieve Hayes: Joanne, thank you for joining me today.
[00:50:13] Joanne Rodrigues: Yeah, thanks for having me on your program. And thanks for asking such insightful questions. And I really enjoyed the discussion and hearing more about your experiences especially around the pricing and around, your experiences as a statistician, as well as, teaching.
[00:50:30] Dr Genevieve Hayes: And for those in the audience, thank you for listening. I'm Dr. Genevieve Hayes, and this has been Value Driven Data Science, brought to you by Genevieve Hayes Consulting.
Creators and Guests
