Episode 101: Why Traditional Statistics Still Matters in the Age of AI
Download MP3[00:00:00] Dr Genevieve Hayes: Hello and welcome to Value Driven Data Science, where data professionals become strategic experts. I'm your host, Dr. Genevieve Hayes, and today I'm joined by Professor Rob Hyndman. Rob is one of the world's most influential applied statisticians. A professor in the Department of Econometrics and Business Statistics at Monash University.
[00:00:25] He has maintained an active statistical consulting practice for over 40 years. Published over 200 research papers, co-authored more than 65 R packages and written five books on time series forecasting. He is also a fellow of both the Australian Academy of Science and the Academy of Social Sciences in Australia.
[00:00:49] In this episode, we'll be exploring why traditional statistics still matter in the age of ai, and how the rush to embrace AI and LMS may be causing data scientists to leave some of the most powerful analytical tools sitting on the shelf. Let's dive in. Rob, welcome to the show.
[00:01:10] Prof Rob Hyndman: Hi, Genevieve. Thanks for having me. Nice to be here.
[00:01:14] Dr Genevieve Hayes: When I started my career, data science didn't exist as a field. I trained as an ACTU and statistician, and those were the tools I relied on in my earliest roles. Then around 10 years ago, I started hearing about the wonders of machine learning. And became worried that my traditional training was no longer enough.
[00:01:33] So despite already having a PhD in statistics, I went back and completed a Master's in machine learning. Then came the AI Wave Chat, GPT, large language Models, generative ai, and that seemed like the next frontier to pursue each step. Felt like it was taking me further from my statistical roots, but then something unexpected happened.
[00:01:56] People started approaching me for projects specifically because of my statistics, background projects with too little data to train a neural network, but where a classical statistical model was a better fit, or where rigorous statistical analysis was the right answer rather than a predictive model. It slowly dawned on me that machine learning and AI weren't the silver bullet.
[00:02:17] I'd come to believe they were. Rather than being the next steps in the evolutionary chain beyond statistics, there were merely two additional tools that could exist alongside it. Now, Rob, I came to this realization relatively recently and somewhat by accident, but you never abandoned classical statistics in the first place.
[00:02:40] In a world that's becoming increasingly obsessed with AI and LLMs, what has kept you advocating for traditional statistics?
[00:02:48] Prof Rob Hyndman: Yeah, that's a great question. I don't really see data science as a new thing. I think it's a new name and it has some new dimensions, but back in the 20th century, lots of people did data science. They didn't call it that. They might've called it statistics. Or operations research maybe or some aspects of computer science.
[00:03:09] But they were doing things with data and trying to build tools and algorithms and models that worked with data. So to some extent it got re badged as data science. And I see that as a very inclusive term that includes all of the things I just described, but also some new sub-disciplines such as machine learning which is relatively new.
[00:03:29] I see what I do as using whatever tools are available to solve the problems that I come across. Uh. Either in consulting or in research. And sometimes I will use a tool that's relatively new that involves things that might be considered machine learning or ai. But other times I will try to build a statistical model to address the problem that I see.
[00:03:53] So it's really what's appropriate for the problem at hand and there's lots and lots of problems with statistical models are still the best tool for the job.
[00:04:03] Dr Genevieve Hayes: Can you give us some examples of some of those cases ?
[00:04:05] Prof Rob Hyndman: Okay, so recently I was working with the Australian Academy of Science, and they were interested in forecasting the aid structure of the scientific workforce by discipline. So what's the aid structure? Are of chemists gonna look like in Australia in 10 years time, for example, are we gonna have too many people retiring and should we be thinking about how to address those sorts of issues?
[00:04:27] So that was the problem that they had, and they had some data on the number of people of different ages over time. I can't just throw that at an LLM and say what's gonna happen because it has no idea about the structure of the problem, about the way the workforce evolves. So I built a model that took all of that into account that we have graduations in certain disciplines that some people who graduate in that discipline will become working in that discipline.
[00:04:56] But other people will switch disciplines. You've got retirements happening, you've got people switching careers. You've got people dying. And so I built a model that had all of those components in it, and we used data on each of the components and then ran the model forward to see what was going to happen in the future.
[00:05:14] So it was a very structured model, built on the particular problem. And that's the way statistics generally works. You think about what the problem is, you try to write down some equations that describe the situation, and then you try to find some data. That will enable you to fit a model like that and use it for whatever purpose.
[00:05:32] It's very different from throwing data at a big LLM and just asking questions.
[00:05:38] Dr Genevieve Hayes: Did you end up using an arima type forecasting model or some sort of demographic model?
[00:05:45] Prof Rob Hyndman: A combination actually. So we built a demographic model, it's called a functional data model because it's a function of age, and so there's this rich statistics literature in functional data models, and then to forecast future functions. We decompose the historical functions.
[00:06:00] Using principal components and we forecast the scores using a arima models. So it's a combination of various statistical techniques to evolve demographic populations over time. And it's been used in population forecasting, but not previously in workforce labor force forecasting. So we modified the model to allow for how labor force works as distinct from how populations work.
[00:06:25] Dr Genevieve Hayes: That's really cool. So at the same time, you haven't spent your career completely ignoring machine learning. Your research does span both classical and modern approaches, but. What does a productive combination of classical statistics and machine learning actually look like in practice?
[00:06:45] Prof Rob Hyndman: So lemme give you. An example based on the forecasting competitions that have taken place over time. So there's a series of competitions which have been run by. A good friend of mine, Spiros Cus who started them when I was in primary school. He's now about 90 years old and still active in Cyprus.
[00:07:05] So in the first competition he collected some data and he asked people to forecast it. This was back in the early 1980s. And so people would post him on a floppy disc, their forecasts, and then he published a paper showing who did best. He did another one in the late 1990s, and then in the third one.
[00:07:24] It was maybe about 10 years ago called the M three competition, M four, M five, and M six. So there's been six so far the M three competition, which was around 1998, 2000 was the first one where any sort of machine learning was included. There was some neural net. They did appallingly badly, just absolutely awful.
[00:07:45] Much, much worse than almost any statistical model. And it was because they were overfitting, it's a model with lots of parameters and you've given it a single time series, which is a limited set of numbers. You generally can't do that. And it was obvious in the results by the time that M four came around.
[00:08:02] Which was I don't know when that was, about 2018 or so was much more understanding of when neural nets are gonna work and the amount of data you need for them. And so the neural nets there did much better, but they didn't win. The models that won. Used a combination of statistics and machine learning.
[00:08:22] So we came second in that competition and our model fitted lots and lots of statistical methods and produced a bunch of forecasts from a whole range of statistical methods. And then we used. Machine learning to choose the weights that we would apply to those forecasts. So it was a machine learning driven ensemble of statistical forecasts that came second.
[00:08:46] The model that came first also used a hybrid of some statistical models and some machine learning approaches. And so by that stage it was about seven years ago. The combination was what was working best. The M five competition it was based around like GBT.
[00:09:02] So it was around these boosted tree models. You would classify that as machine learning, not statistics. So over time they've got better. But there are still plenty of problems where you can't just throw a machine learning off the shelf model at it, and it's gonna give you anything good.
[00:09:17] Dr Genevieve Hayes: It seems like the best approach for many problems is actually a combination of models from different disciplines. So in this case, you got good success by combining statistics and machine learning.
[00:09:31] In your previous answer, you had success in combining forecasting and demographic models.
[00:09:37] Prof Rob Hyndman: I would say you've always gotta think about the problem and use whatever tools can be put to bear to give you a good answer. Often it's a combination, but not always. The first really successful example of applying a neural net. In a forecasting context that I ever saw was at Amazon around 10 to 12 years ago.
[00:09:57] I was visiting Amazon running forecasting workshops for their Berlin forecasting group, and they were showing me how they forecast all of their products and they had this massive deep learning neural net across their entire range of products. They were using it to predict sales so that they could make sure they had stock available in the right places at the right times.
[00:10:20] And a lot of those products have no history. There's a new. Widget that they're selling or a new book that has no historical data. So you can't build a statistical model on that 'cause there's no data to work with. But if you built a model across all products, then you could use the features of the product.
[00:10:40] Okay. Is it a book? What genre is it? Who wrote it? Anything at all that you know about the product. You can use that as an input into the deep learning network and then it will find other related products and use those to give you forecast. So it's an example where a statistical model really can't work but a neural net, because it's trained to cross a whole lot of time series simultaneously can work.
[00:11:04] Dr Genevieve Hayes: So it would recognize that the latest iPhone has a lot of similar characteristics to the previous model. iPhone, for example.
[00:11:10] Prof Rob Hyndman: Exactly. Yeah. Or if some company introduces an entirely new phone, recognizes it's a phone and it says we've had phones before and here's some other examples of phones from independent providers and this is how they've done so I'm gonna use that data. So it doesn't think like that, obviously, but.
[00:11:28] Anthropomorphically, that's the way we can think about the models.
[00:11:32] Dr Genevieve Hayes: When I first started working as a data professional, my role predominantly involved fitting GMs and performing statistical data analysis work. And one thing I've noticed over the past decade is that the role of many data professionals has now evolved into either building dashboards or fitting machine learning models.
[00:11:55] And. I don't see that statistical data analysis piece so much anymore which is a shame because that's what I've always found to be extremely valuable to businesses. What do you think is being lost when data professionals move away from rigorous statistical analysis towards either building dashboards or fitting machine learning models?
[00:12:19] Prof Rob Hyndman: So what you lose is a measure of uncertainty and thinking about the world probabilistically. So in a forecasting context, a statistical model's not just giving you a number for the future, it's giving you a whole probability distribution. So if I'm talking about the electricity demand tomorrow.
[00:12:38] Statistical models will give you a probability distribution for that. So you know that it's the range of possible values and how likely those values are gonna be. Whereas an off the shelf neural net for that problem will just give you a number. You give it the input data, the temperatures and past demand, and it will give you a number.
[00:12:57] So you've lost the sense of uncertainty. Now, obviously there are some. Machine learning methods these days that are trying to address that problem. There's quantile neural nets when you're trying to predict quantiles rather than just the average. There's a whole field called conformal inference. But these are not so widely used.
[00:13:15] People still use. The single number point forecast from some big black box machine learning model, and you've got no idea how uncertain your answer is. So when I'm doing consulting work for business, industry and government, I will always give them a prediction interval or usually several predict intervals so they know how unsure I am about the result.
[00:13:39] It's both an insurance against them saying, I got it wrong, but it also helps them use the number in decision making. If I give them, a forecast with a very tight prediction interval, then they're much more confident about making a decision. Around that forecast. And if I give them a forecast with a huge prediction interval, basically I'm saying, I've got no idea what's gonna happen.
[00:13:59] That will change the types of decisions that get made. So if you lose the estimate of uncertainty, the understanding of variability, then you're really not. Using the information as well as you can. So that's in a forecasting context in a historical analysis context. Uncertainty is also important in terms of understanding, is what I see statistically significant.
[00:14:24] So is it likely due to just randomness or is it actually a real effect that I'm seeing without a measure of uncertainty? You can't answer that question.
[00:14:33] Dr Genevieve Hayes: What you were just saying about prediction intervals. That was reminding me of something I was reading in how to Measure Anything by Douglas Hubbard, where he was talking about the value of reducing uncertainty for business decision makers. The framework that he outlines involves understanding what the range of uncertainty is when he starts solving a problem.
[00:14:57] And that might be based on. The estimates of subject matter experts and then using statistical techniques to reduce that prediction interval. And you're right in saying you can't do that with many of these modern machine learning models. But if you're using something like, say a GLM you have that at your disposal.
[00:15:22] Prof Rob Hyndman: Absolutely. That's what they're designed to do is to measure uncertainty not just produce a prediction. I guess the other thing here at least in forecasting, is around the amount of data available. So typically in a statistical modeling context, if you've got a time series, you build a model for that time series.
[00:15:41] So you train a model just on that data and then you use it to forecast the future. So the series length is impacting in two ways. It's impacting how good your training is, but it's also provides the conditional information that you're using to forecast. And the more information you have generally, the better you're gonna be able to forecast when you use an LLM.
[00:16:03] Typically, it's not just trained on that time series. It's usually trained on a whole pile of things. Particularly the modern transformer models, they're trained on an enormous amount of data. Even the models that are specific to forecasting are trained on billions of time series. So when you give it your individual series, the one you're interested in, it's not actually using it to train the model.
[00:16:25] It's only using it as the conditional information to produce the forecast. So the training itself should be better, but you haven't got more data, so you haven't got more information that you're conditioning on. And if you've got an annual time series of 14 observations, then yeah, it's a relatively long annual series.
[00:16:42] You've still only got 14 numbers and it doesn't really matter what you do, you're not gonna get more information than those 14 numbers. So there's a very limited amount of improvement you can get from fitting some fancy model. When you've got. A limited series of data that you're throwing at it.
[00:16:59] Dr Genevieve Hayes: And then you risk overfitting anyway.
[00:17:02] Prof Rob Hyndman: Yeah. So LLMs in forecasting work best when they're applied to lots of series and where you get a relatively long individual series for forecasting. And if you're not in that situation, you're generally gonna struggle to beat standard statistical methods.
[00:17:20] Dr Genevieve Hayes: When developing their technical skills, many data scientists treat statistics as an afterthought and never bother learning beyond. First year stats, so descriptive stats and maybe a bit of hypothesis testing and linear regression. However, there's so much more to statistics than just that.
[00:17:39] Beyond those basics, what do you consider to be the most important statistical concepts for data scientists to understand if they wanna create business value?
[00:17:49] Prof Rob Hyndman: It's even worse than that. There's often a caricature of what statistics is. I've heard. Professors in computer science talk about statistics being, oh, it's all linear, and Gaussian as if that's the only thing that happens. That was perhaps the case at the start of the 19 hundreds, but for the last hundred years we've been developing other tools.
[00:18:10] So the most important things in statistics is the way of thinking about data. In a stochastic way. Like it's about probability distributions and it's about building a model that, understands uncertainty. If data scientists could at least pick up on those ideas I think they would, be much better off Now.
[00:18:27] It is improving. There are, textbooks now about. Probabilistic approach to machine learning, for example. And in the top machine learning conferences, there's often lots of probabilistic stuff now that they didn't use to be. So it is improving. And in the statistics literature you see a lot more machine learning information as well.
[00:18:47] So if you go to say, the Journal of the American Statistical Association, arguably the top stats journal in the world, there's often lots of machine learning types material in there.
[00:18:58] Dr Genevieve Hayes: I know that the universities have to teach descriptive statistics to first year stats students, but the big problem in doing that is. It puts a lot of people off actually doing statistics in later years. I remember when I started university, my knowledge of statistics was based on what I'd learned in year 12, which was statistics is calculating variances using a calculator, and that was a traumatizing experience and I never wanted to do statistics ever again.
[00:19:31] Prof Rob Hyndman: It's very boring too. Statistics is a way of thinking. It's a way of thinking about the world a way of thinking about uncertainty and randomness and probabilities. And it's very powerful. Once you've got that sort of mindset you approach all sorts of things in life differently.
[00:19:48] Dr Genevieve Hayes: That probabilistic aspect of statistics was actually what got me ultimately interested in it. When we finally did cover that at university, and I was lucky in that there were compulsory units in my degree that made me do it because otherwise, if I'd only had that first year stats experience, I think I would've tapped out very early. In addition to your academic work you've also maintained an active statistical consulting practice for over 40 years. What sorts of problems do organizations bring to you to solve?
[00:20:20] Prof Rob Hyndman: So these days it's usually some kind of forecasting problem or a statistical problem that they dunno how to solve. And then generally that's. I will only take on a problem if I don't know how to solve it immediately. Like if it's obvious how to solve the problem, then it's not really interesting to me because my motivation for doing this type of work is it gives me ideas for research.
[00:20:44] And almost all of my best research has had some genesis of the idea in a consulting project. And so I will, take on a job and provide enough of an answer to the organization to help them address the immediate problem. And there's always time pressure. So you don't necessarily give them the final, thought through solution with all the mathematical rigor, but you give them something that works.
[00:21:08] And then I think about it over time and sometimes over years and often will come up with a academic paper that the thinking of, it started back in the consulting problem. So that's why I do it I want to ground my research in things that are useful rather than in just imagining how you might tweak some existing method to do something else, which is a very common thing in the academic world, and there's no interest to me. I want to work on problems that I can at least see that somebody is gonna be able to use. And so I try to ground it in problems that are arise in consult.
[00:21:43] Dr Genevieve Hayes: What does the final deliverable typically look like? Is it a report telling your client the direction in which they should head, or do you actually produce a fitted model they can use?
[00:21:54] Prof Rob Hyndman: Usually fitted models and forecasts or depends on the situation. So one of my very early. Consulting projects, which became well known was the pharmaceutical benefits scheme. So back in the early two thousands, the government had underestimated the budget they needed for the pharmaceutical benefit scheme by between half a billion dollars and $1 billion per year.
[00:22:17] That's a lot of money even for a government to find. And so they just were looking for someone who knew about forecasting and called me up. And said, we need some help. Can you help? So the deliverable in that case was a couple of reports reviewing the way they did it and providing some, approaches that would do better. And also it was providing some software for them to implement a better approach to forecasting. And instead of being under forecasting by between half and $1 billion, they went to being plus or minus about 50 million. Which was sort of the measure of uncertainty which is much more.
[00:22:54] Achievable for government and they've used the same model ever since. The idea for the model that I came up with in that project turned into what is now called an ETS model which are extremely widely used around the world. One of the most widely used automatic forecasting methods, but it started with the PBS consulting project.
[00:23:12] Dr Genevieve Hayes: I know I've come across them. What does a TS stand for again?
[00:23:16] Prof Rob Hyndman: So it stands for Error trend seasonal and also it's built on top of exponential smoothing. The letters ETS also appear in exponential smoothing.
[00:23:25] Dr Genevieve Hayes: Yes. I've used them at some point. I just can't remember what I've used them for. Beyond forecasting specifically, where do you see rigorous statistical thinking making the biggest difference in how organizations use data?
[00:23:39] Prof Rob Hyndman: I dunno that I can answer that question 'cause I'm, don't do enough work in organizations outside forecasting to know how they work with data. One of the things that I'm currently working on, not forecasting, is anomaly detection and trying to build a much more principled, statistical way to think about anomalies.
[00:23:58] So traditionally. Even statisticians have been very vague. And an anomaly is something that looks like it's come from a different distribution than the rest of the data, but it's not really a proper formal definition and it's hard to operationalize that or turn it into an algorithm. And machine learners who've tackled the problem have done it from a distance perspective.
[00:24:19] They look at all the distances between observations and they decide something's an anomaly if it's too far away. But there's not really a probabilistic. Case made about what this too far means. So I'm trying to think about that in a more principled, probabilistic way. And I've been writing papers on that and I'm currently writing a book on it.
[00:24:38] And I hope that will have an impact that people will then think about this problem in a much more as I say, principled and probabilistic way rather than the sort of ad hoc and vague way that people tackle it, because that has implications for things like fraud detection or spotting.
[00:24:54] Problems on security issues in servers or in any kind of security system situation where you've got interventions happening that you wanna be able to spot.
[00:25:04] Dr Genevieve Hayes: For a data scientist who wants to strengthen their statistical foundations, where should they begin?
[00:25:10] Prof Rob Hyndman: So I would be trying to just. Read the best books on statistical modeling in that context. So as I said, there's a really good one called Probabilistic Machine Learning by a guy called Murphy. So if you want something that comes from a machine learning perspective, but gives you the more probabilistic approach, that's a really good textbook.
[00:25:30] That's fat. So it'll take you a while to work through it or. Look online around who's working in this space that are doing probabilistic approaches rather than, deterministic or purely point-based approaches.
[00:25:43] Dr Genevieve Hayes: The name that sprung into my mind while you were talking was Nate Silver.
[00:25:47] Prof Rob Hyndman: it said H sil did a fantastic job. When he first wrote his book 20 years ago or so on showing how probabilistic thinking can improve forecasting he was working on political forecasting, on baseball forecasting and basketball forecasting. And his book also talks about a few other areas, great book called The Signal and the Noise.
[00:26:08] Some of his later work seems to have gone off on a bit peculiar tangent. But definitely that book is worth reading.
[00:26:16] Dr Genevieve Hayes: Yeah I've read that it's an excellent book. So for listeners who wanna get in contact with you, Rob, what can they do?
[00:26:22] Prof Rob Hyndman: Best place is my website rob j Hyndman.com. And I have links there to. Software, papers, my books. So all my books are open access all the recent ones. So the one that I've been working on for the last 15 years or so in various editions is called Forecasting Principles and Practice.
[00:26:43] There's links on my website. The publisher is oex O-T-E-X-T s.com. The book is Free and Open Access. There's our additions. There's Python editions. There's additions in multiple languages. Chinese, Japanese, Korean, Italian, Portuguese, Spanish. We have a lot of additions out. I spend a fair bit of time on that.
[00:27:04] The publisher, OEX is actually owned by me. I needed to set up my own publisher because I couldn't convince commercial publishers to have open access online books at the same time as having print editions. And I really wanted to do both. So in the end, I just set up my own publisher and did it these days.
[00:27:25] Other publishers now do it. CRC Chapman and Hall will have free online editions, as well as print editions. Springer allows it, they do it now, but they didn't when I started, so I just did it myself.
[00:27:36] Dr Genevieve Hayes: So that's it for today's effort. So of value driven data science. But if you want more from Rob next week, you can catch our value boost episode where we explore how selectively giving away your work for free can be one of the most powerful strategies for building authority and influence as a data professional.
[00:27:56] And if you found today's episode useful and think others could benefit, please leave us a rating and review on your favorite podcast platform. That way we'll be able to reach more data scientists just like you. Thanks for joining us today, Rob,
[00:28:11] Prof Rob Hyndman: It was great to be here.
[00:28:13] Dr Genevieve Hayes: and for those in the audience, thanks for listening. I'm Dr.
[00:28:16] Genevieve Hayes, and this has been Value Driven Data Science.
Creators and Guests