Value Driven Data Science: Boost your impact. Earn what you’re worth. Rewrite your career algorithm. | Transcript: Episode 36: Sequential Decision Problems

Episode 36: Sequential Decision Problems

May 9, 2024 / 01:14:21/E36 Download MP3

[00:00:00] Dr Genevieve Hayes: Hello and welcome to Value Driven Data Science brought to you by Genevieve Hayes Consulting. I'm Dr. Genevieve Hayes and today I'm joined by Professor Warren Powell to discuss sequential decision problems. Warren is the co founder and chief innovation officer of Optimal Dynamics and a professor emeritus after retiring from Princeton, where he was a faculty member in the department of operations research and financial engineering.
[00:00:31] He is also the author of sequential decision analytics and modeling and reinforcement learning and stochastic optimization. Warren, welcome to the show.
[00:00:42] Prof Warren Powell: I always appreciate being invited to talk about my favorite topic.
[00:00:46] Dr Genevieve Hayes: I'm looking forward to hearing about it. Decision making is an essential part of everyday life. And one of the main applications of data science is making the decision making process easier. However, most of the time when data scientists build models, it's to make a single decision or to solve a single problem.
[00:01:07] Whereas in real life, decision making is rarely that simple. One way in which the decision making process can become more complicated is in the case of sequential decision problems, which is your area of expertise, Warren. I'm sure you can do a much better job of explaining sequential decision problems than I ever could.
[00:01:29] So to begin with, Warren, can you explain to our listeners what is meant by a sequential decision problem?
[00:01:38] Prof Warren Powell: We all make decisions and time moves forward. It's almost never the case that we make a decision just once and stop and that's it. And it's never made again. Now I had my start in freight transportation, truckload trucking. Think of Uber for freight. Where you have a driver with a truck and he has to move a load from A to B and then he unloads and he has to pick up a new load.
[00:02:03] The difference with Uber is Uber trips might be 15 minutes or 30 minutes. With truckload trucking, it's Typically a day, or two days, or three days. In the United States, you can have loads that take four or five days, because they go all the way across the country. And we have to think about which driver to move the load, because we have to think about what he's going to do at the destination.
[00:02:24] All these drivers have homes. We'd like to get them loads that will get them home. The problem is, I don't know what loads will be available at the destination of the load. So I have to make a decision now, under the uncertainty of the decision I'll be making in the future. So I have to assign a driver now.
[00:02:40] That's a decision. Then time passes. He moves the load at the destination. I have to make another decision. So we're constantly making this decisions of assigning drivers to loads over time. Then we wait for new loads to come in and then we make more decisions. This happens pretty much everywhere.
[00:02:57] Every time we choose what's the best path to drive on one day, we're going to make that decision again the next day. Every time I pull money from an ATM machine, I'm going to have to pull money and then I'll be making that decision in the future. If I'm in a business and I'm pricing my product, I will have to update that pricing periodically over time.
[00:03:16] If I'm a scientist in a lab. I have to run an experiment, look at the outcome seat, and then decide the next experiment. If I'm a physician, I'm going to try a new medication with a patient. I have to wait to see how that patient responds. Then I'm going to make another decision to whether to stay with that medication or switch to another one.
[00:03:36] So pretty much all of us make decisions. And every time we make these decisions, maybe we learn something, maybe we change the world, whatever it is, it's not going to be the last time we make the decision.
[00:03:49] And when we make the decision. Information will come in after I make the decision that will affect, like, the food. Did I like the food? I have to try the food, but then I have to see if I like it, and then I change what I do. Did I like that restaurant? I go to the restaurant, see how I liked it, maybe I'll come back.
[00:04:06] Pretty much every decision is a sequential decision. I actually struggle to come up with a decision that's not sequential because time never stops.
[00:04:16] Dr Genevieve Hayes: What I find interesting? I did a statistics degree and I took every statistics subject the university had to offer by the time I finished my undergrad studies and sequential decision problems didn't come up at all. do
[00:04:32] you,
[00:04:33] Prof Warren Powell: time we make a sequential decision, we're going to do it. With our best belief of the world. Now, that's where statistics comes in because very often, whatever my decision is, I will have what's the best travel time on the route? How will the drug respond? How do I know? Well, that I'm going to get through statistics.
[00:04:51] I'm going to take past data and run some statistical to get an estimate of will that drug work? What's the travel time on the path? How much will that stock respond? How much should I charge for my book? I'm going to use statistics to give me my best estimate, but it's still going to be the case that after I make the decision using my best estimate based on statistics, the real number will then come in and then I'll have to make the decision again.
[00:05:18] Now, sequential decisions, I'm the one who finally settled on the word sequential decision. I didn't invent that term, but it's not the standard term. In the academic community, we'll call it dynamic programming, or stochastic search. There's a community that calls it simulation optimization.
[00:05:36] Multi armed bandit problems, also called optimal learning. Optimal control, a huge field in engineering called optimal control. All of these are fields that study sequential decisions. One of my favorite slides over the last 10 years has been a picture of a jungle, and then I paint it with the front covers of about 15 different books, each of which represents a different field.
[00:05:59] There's 15 distinct fields using eight fundamentally different notational systems, all of them working on sequential decision problems. It really is a classic jungle, and they all speak their own languages.
[00:06:13] They all have their own theorems. They each have their own group of stars who think that they're the most important people in their little subfield.
[00:06:20] So statistics, let's imagine that I'm collecting data and I get to choose the data. Let's say maybe I want to have demand as a function of price and I want to estimate how much to charge for my product, or I want to find the best medication for this patient. I can use statistics. Statistics tends to assume you already have a data set, but let's say I get to have some influence or get to choose what data I collect.
[00:06:44] So I can see how this patient responds to medication by saying, let's try that medication. Or, hey, how does the market respond to price? Well, let's try this price. I'm going to do statistics, but the data in my data set, I choose it, or at least influence it. So you were probably studying problems where you were given a data set from which you had to derive a statistical model.
[00:07:08] All right, but where did that data set come from? And there are so many problems where you get to either choose or influence the data that you observe. Those are sequential decision problems also.
[00:07:20] Dr Genevieve Hayes: can imagine. Bayesian techniques coming into this at some point,
[00:07:25] Prof Warren Powell: Absolutely. Although I use Bayesian logic a lot, I do frequentist. I sometimes use both at the same time. So, for example, I may have a belief about how a patient responds to a medication. Now, I can't run a hundred different experiments on that one patient. I have got to come to that problem with some Bayesian priors.
[00:07:45] But then I'm estimating how The patient responds, and I'm going to do that with frequentist work. Now, if you go to a Google or Facebook and you have to choose what ad to put up, which gets the most ad clicks, they don't really do Bayesian work.
[00:07:59] There's so many thousands of clicks. That type of work is pretty frequentist, but if I'm a scientist where one experiment might take a week. Okay, I worked with this professor at Princeton in material science. One experiment was a month in the
[00:08:13] lab, one month to get one data point. That has to be Bayesian.
[00:08:18] You've got to come in with some prior knowledge.
[00:08:21] Dr Genevieve Hayes: You've been doing this your whole career. How have you seen the sequential decision analysis space evolve over that time?
[00:08:27] Prof Warren Powell: Well, so right now, everybody who gets trained somewhere in the jungle, they become an expert, they've got their theorems, they've got their papers. To publish in these fields, you've got to publish in one of the sub communities, because that way you get referees who can review work. So, the academic community gets really entrenched.
[00:08:48] Now, it just was the nature of my work, I ran a big lab. And the thing is, you don't really learn this unless you're doing experimental work. You can't just be doing theorem proving. Theorem proving, you don't need all these tools because you're just proving theorems. But if you're doing numerical work on real problems, and I did a lot of work with real companies, I also did a lot of academic work, but I did a variety of problems.
[00:09:08] And after spending about two decades working on problems where, okay, so I called it approximate dynamic programming, the computer scientists at the time called it reinforcement learning, always based on Bellman's equation, and oh my gosh, Bellman's equation. I was the guy with the hammer looking for a nail, and I was really good at looking for nails.
[00:09:28] Now after a while, you're like, okay, I've done enough trucking problems, what else is there? And I got into energy, and then I got into health, and as you get into different areas, well, you know, you could just look at a problem and go, well, Bellman's equation just isn't going to work. I'd rather use this technique.
[00:09:45] And at one point, I realized all at the same time in my lab, I was using a variety of techniques. Back in 2011, I wrote the second edition of my ADP book, and chapter six was a chapter where I stepped back and I said, you know what, there seems to be more than one way to solve these problems. So Bellman's equation, all right, that's a good way.
[00:10:04] And we academics seem to like that, but well, you know, every time you use Google Maps, you're solving a sequential decision problem. So let's say you're doing a long trip. And Google has this wonderful way of getting updates from the traffic and occasionally saying, oh, I'm going to change your trip.
[00:10:21] Now, how are they doing? They're doing it with something that I call a look ahead policy. They're just solving sort of path problems,
[00:10:27] looking into the future and with updated information.
[00:10:30] They're not using Bellman's equation. Bellman looks ahead in its own way, but the way it does it, it's just not practical for what Google Maps is trying to do. But now let's take an inventory problem where, when the inventory gets below a certain level, you order up to another level.
[00:10:46] Now, inventory problems are one of the favorite applications of Bellman's equation, pretty much guaranteed, absolutely nobody uses. That's Bellman's equation in practice to solve an inventory problem. Everybody does, oh, the inventory just got low, I've got to order another. Now that's just a simple rule.
[00:11:04] It's like finance with buy low, sell high. I'm not doing anything fancy there, but I have a simple rule and it has two tunable parameters. And so here we have an inventory problem. I'm going to solve with a simple rule. I've got Google Maps that I'm going to do a look ahead policies. I had my trucking problems that I could use Bellman's equation.
[00:11:23] It was very successful. It was successful on those problems, but I would never use Bellman's equation on this inventory problem. So as you work on a diverse enough problems, there's a lot of problems where it's like, well, of course, I'm going to solve it this way. And this problem, of course, I'm going to solve it this way.
[00:11:41] So it took me a lot of courage to realize that all the different ways of solving problems could be divided into four classes that I call the four classes of policies. And this isn't from a theorem. This is from school of hard knocks, looking around, looking at the literature, looking at what everybody was doing.
[00:11:59] And the first time I stood up in front of an academic audience and I said, I think there's four classes of policies, I welcome people to challenge me. And then I kept doing that over and over and over, and nobody was challenging me. And after a little more thought, I started to realize, you know what? Yeah, I'm pretty sure that these four classes of policies cover everything, you know?
[00:12:21] And I finally, last year, I came out with my big book the Reinforcement Learning and Stochastic Optimization. It's 1100 pages. It's entirely built around the four classes of policies. Now you go back into the jungle and you have all these experts, and pretty much they're all experts in usually one and sometimes two of the classes.
[00:12:41] The academic community is pretty good at three of the four classes. One of the classes nobody in academia knows anything about, widely used in practice, that I call parametric cost function approximation. And CFAs are one of the most powerful methods. And it's what's really used in practice for more complicated problems.
[00:13:02] And I discovered it by watching how our local grid operator would sit on Monday and plan the energy generation for Tuesday. And so they're using a look ahead, just like Google Maps is a look ahead. Theirs was much more complicated, but they had to deal with the uncertainty that generators sometimes fail.
[00:13:20] And they had to plan for tomorrow with the guarantee that if the biggest single generator failed that they wouldn't have an outage. So how did they do that? Well, the academics came roaring and says, Oh, my gosh, you have to use this incredibly complicated technique. But we've been studying it for 70 years. And almost nobody uses it, called stochastic programming, and we're very proud of it. There's thousands of papers written every year. Nobody uses it. And there are these grid operators, uniformly all around the world, I'm pretty sure. In the United States, they all use the same software. They plan for tomorrow deterministically.
[00:13:57] The way they handle uncertainty is the way we all handle uncertainty. We put in fudge factors. So when we use Google Maps to plan a path to a destination, it's 40 minutes away, but I really got to make that appointment. I'm okay that Google Maps used a deterministic approximation network. But when Google tells me it'll take 40 minutes to get there, I'm going to go, you know what, I don't know about that 40 minutes.
[00:14:19] I'm going to add in an extra 15 minutes just to be safe because I really don't want to be late. That's a fudge factor. The problem is nobody thinks of that as stochastic optimization. And I realized what industry was missing. It was fancy words and fancy math, but what they were doing was perfectly fine, although they did it in an ad hoc way.
[00:14:39] And the first time I presented this at an energy workshop, one of the grid operators came up and said, Oh my gosh, this is great! Do you have a package? Because they do their tuning in an ad hoc way. The problem is their tuning has a lot of domain knowledge and it's very hard for an algorithm to replace that, but the concept of solving a deterministic approximation and throwing in some tuning parameters, I realized was incredibly powerful.
[00:15:03] I put together a web page. tinyurl. com forward slash CFA policy is a web page that explains the concept of doing a deterministic approximation. but with tunable parameters. Very simple, very powerful,
[00:15:19] and so
[00:15:19] that's the second of my four classes of policies, and that nailed it.
[00:15:23] As soon as I came up with this, I said, okay, that's it, I've got the four classes.
[00:15:27] Dr Genevieve Hayes: so with that one you just described, the tunable CFA policies. So are you actually tuning the fudge factor in order to
[00:15:35] optimize it?
[00:15:36] Prof Warren Powell: let me first of all relate back to your background in statistics. I'm pretty sure you have a pretty good command of what a parametric statistical model is. By parametric, it means you have a statistical model, which is an analytic function, and you have parameters that have to be fit.
[00:15:52] Dr Genevieve Hayes: Yep. So like a linear regression model, for
[00:15:54] Prof Warren Powell: Yeah, welcome to tunable parameters.
[00:15:58] So linear regression is an example of having a nice mathematical model with tunable parameters, and then you use a training dataset. Now the CFA policy, think of it just like a parametric model. So the first difference is that the underlying model isn't a simple parametric model.
[00:16:15] might be a linear program. It could be a shortest path problem. And secondly, the objective function is whatever objective function I have for my optimization problem, say minimizing cost or maximizing profits or whatever. You have the objective function of fitting the data. Fine, that's your objective function of statistical model. But over here with linear programming, I've got some objective function that's relevant for my problem, like Google Maps, I want to minimize the travel time. Okay, but I really also want to minimize the risk that I'm going to arrive late. So I throw in this tunable parameter. So tuning parameter, just think parametric statistics, you're tuning parameters.
[00:16:51] In a CFA policy, I'm tuning parameters. We have different objectives, but okay, we both have our own objectives. You have a linear model. I have a linear program. Okay, it doesn't have to be linear, but it's something. It's so close to parametric statistics. It's so similar to parametric statistics. But it's a policy first sequential decision problem with a different objective and no training data set.
[00:17:15] Dr Genevieve Hayes: Okay. So how do you tune parameters without a training data set?
[00:17:19] Prof Warren Powell: Well, I already have my own objective function. So let's say, for example, that I'm going to take my Google Maps application and I want to minimize travel time. Real world example. I used to do work with a company up in Hartford, Connecticut.
[00:17:33] I live in Princeton, New Jersey, and guess what? In between us is this city called New York City. And I had this nice example where Google Maps says, okay, to get from Hartford down to Princeton, guess what? The shortest path runs right through New York City. Okay? And I'm like, okay, this is crazy. You're going to have me going through New York City at 5 p.
[00:17:52] m., but there's a path that's about five minutes longer that goes all the way around New York City, and I don't have the risk of New York City congestion. Now, what I have to do is how much am I willing to pay or how much am I willing to penalize? Well, let's say I have a penalty for being late. Let's say, you know, coming home and late for dinner, I'm going to put a well defined penalty on that.
[00:18:12] So I have an objective function anybody solving a linear program has an objective function. They have a model of the problem. Okay, I don't need a training that says, I do need my own objective, I do need my own model of the problem that I'm going to optimize.
[00:18:28] But, let's say I optimize it, and let's say there's tunable parameters, but I fix those. Now, I have to test the solution in a simulated environment. When you do your parametric statistics, you're testing on a training data set. That training data set is a random sample of what you might observe. When I'm simulating my policy, I have to optimize, get a decision, and then use simulation.
[00:18:54] So I can simulate in the computer, but I can also simulate in the real world. I can actually go make that trip and see whether I arrived. Late or not.
[00:19:04] And if I arrive late, I'm going to have a penalty. Boom. Bad for me. I twiddle my parameter using the same types of algorithms, by the way, that the machine learning people use to say fit those neural networks.
[00:19:16] Stochastic grading algorithm might be a perfectly reasonable algorithm.
[00:19:19] So the algorithmic strategy, not linear regression is kind of a special case, but let's say we have some nonlinear problem and I'm going to use a basic gradient search. It's really a stochastic gradient algorithm. That's standard method for parametric statistics.
[00:19:34] I'm going to use the same method when I make a decision, implement the decision, then do either in the field or in the computer, I simulate how well that decision work. I get an answer. It feeds back to my stochastic gradient. I do it all over again.
[00:19:49] Dr Genevieve Hayes: So instead of training everything in one go, like you would with the neural network, you come up with an initial solution, test it. modify the parameters or retune the parameters to allow for the additional information that you've collected. And then you make that decision again, given the updated information and so on and
[00:20:12] Prof Warren Powell: But let me go back to a statistics problem, but instead of your batch data set, Let's say I'm that doctor again, and you're a new patient who is type two diabetic, and you're giving medication for the first time, there's a medication called Metformin, that's the standard go to medication, and it works about 70 percent of the time. THere's about two dozen other Medications for lowering blood sugar, but everybody responds to those medications differently. So I now have a statistical problem of estimating how this one patient, a particular patient now.
[00:20:47] So I don't have some big data set for that one patient. All I have is, okay, metformin didn't work. I got 24 others. Maybe I will be slightly Bayesian because I do have an idea how well this tends to work on the broader population, but that still doesn't tell me how it's going to work on this one patient.
[00:21:05] Now I have to start doing trial and error. To test out different patients and use my standard statistical tools, but I'm going to get one observation at a time and it is a sequential decision problem because I have to try the medication, see how it works, but I'm using my standard statistical tools. Now, how do I go about choosing what medication to work?
[00:21:25] Let me tell you a very powerful, very simple method. Let's say I have these 24 medications. And each one's going to lower my blood sugar by some amount that I don't know, but I have a probability distribution. I would like to have the medication that lowers the A1c the most but I have a lot of uncertainty.
[00:21:44] So this one looks best, but this one that doesn't look as good might even be better on this particular patient. Let me tell you a fabulous policy. I'm going to line up all my medications, I'm going to use a point estimate, I'm going to go up theta standard deviations, tunable parameter, theta standard deviations, and then I'm going to pick the medication that looks best by taking the mean plus theta times the standard deviation.
[00:22:09] This is actually a very powerful strategy. But you've got to figure out what theta is. Now I have an optimization problem because I'm looking at all the medications, picking what's best. So it's not a hard optimization problem, it's a sort. Okay, but I'm sorting on this metric that depends on this tunable parameter theta.
[00:22:27] Tuning theta is really quite important. You got to get it right. When you get it right, this policy works unbelievably well. But it's not that easy getting theta. Correctly. Now, you don't tune theta for every patient, okay? So, tuning theta can be done on test data sets and maybe you can do it in a simulator, but it's harder than it looks even inside a simulator.
[00:22:48] So, that's an optimization problem with a tunable parameter. Now, it's not a linear program. It's not a shortest path problem, but it's a sort. So, that falls in this class called a parametric cost function approximation. But again, this is not a panacea. It's one of the four classes. It's fabulous. in certain problem settings.
[00:23:09] And by the way, it doesn't use Bellman's equation. So that's a fully sequential decision problem, but I'm not going to use Bellman. I could use Bellman. But it's complicated and hard and messy. And I'd much rather use this much simpler policy, but the price you pay as a tunable parameter.
[00:23:25] I've got another policy in the fourth class that doesn't have a tunable parameter. But it's not so simple. So you have to look at that and say, well, this one's really simple, but I got a tunable parameter. Here's this thing with Bellman's equation. Oh, my gosh, look at the mathematics. Oh, here's another thing.
[00:23:44] By the way, I'm very proud of it. It's called the knowledge gradient. It's a form of look ahead. Same classes, Google Maps. It's a look ahead. It's kind of a fancy look ahead. Okay. And simple versions of it are pretty simple, but more sophisticated versions, you know, you got to sit down and learn some tools.
[00:24:01] In other words, choosing the right policy is not just a matter of saying which one works best. There's going to be issue of how hard is it to learn computationally? How hard is it to compute? There's a lot of things that you trade off before you say, like medical people. Transparency is really important.
[00:24:22] Dr Genevieve Hayes: You've mentioned that there are four classes of policies and you've spoken about the parametric cost function approximation. What are the other three classes?
[00:24:32] Prof Warren Powell: Let's start from the simplest policy function approximations. Simple examples are inventory. When the inventory goes below one level order up to another level in finance, it's buy low sell high. It could be a linear model, it could be a neural network, it could be any function that you use in statistical modeling, any analytic function
[00:24:55] What the first class doesn't have is an embedded optimization problem. So the other three classes, all three of those classes, there's an embedded optimization problem. The first one is the one that doesn't. And this is the class that we humans get through life with. Okay, if it's cold, wear a coat, now, once you have the sense of I've got a choice, so there's a lot of decision problems that involve a choice like what appears to be the best drug and I said, okay, here's these different drugs.
[00:25:23] I have to sort but. I'll have a tunable parameter in there. So the first class PFAs, there's always tunable parameters. When the inventory goes below one level, order up to another level. I've got to tune those two levels.
[00:25:36] Dr Genevieve Hayes: So it's just a if a, then b statement.
[00:25:39] Prof Warren Powell: Yes. Okay. So it can be a simple statement like that. By the way, it could be a full linear model. I could say, hey, how much water should I release out of the reservoir? Well, that's a function of rainfall, temperature, humidity, various things. I end up with some linear model that I put everything in and it comes back and says, here's how much water to release from the reservoir.
[00:25:59] So think of this as linear regression. It could be linear regression. It could be non linear. It could be neural networks. It could be anything you use in statistics. The only difference is the tuning is not being done with a training dataset. It's being done with some objective function.
[00:26:14] The first class is so similar to parametric statistics. The only difference is parametric statistics, you have a training dataset, and the objective function is always some form of metric so that your linear model matches the training dataset.
[00:26:31] Dr Genevieve Hayes: So it's
[00:26:32] something like minimizing the root mean squared error, for example.
[00:26:35] Prof Warren Powell: Yeah, exactly. Now my PFA I'm going to use the same set of functions, some sort of analytical function, but I will be given an objective function like minimizing cost, maximizing profit whatever. I've got some objective and then usually what I'm doing is I'm I mean, so most of the time I'm running simulations, but there are real world examples where people do the tuning in the field.
[00:26:59] But the PFA's, the decision tends to be something fairly simple. So for example, the grid operator which is trying to schedule energy generation for tomorrow, they've got this big energy program. So they solve a big deterministic energy program and they throw in some tunable parameters,
[00:27:16] I can't do that with a PFA. So, the second class, the CFAs, are where, typically for problems where the decision is a bit more complicated, like a shortest path problem. Maybe that problem with the drugs, or I have to find the best drug,
[00:27:30] So of the 4 classes, we divide the 4 classes into 2 broad classes. The policy search class is the 1st of the 2 and that consists of 2 subclasses. The PFA and the CFAs. PFA, simple rule. CFA is usually going to be some sort of deterministic approximation. It is an optimization problem.
[00:27:51] Now, the optimization can be a sort, the optimization can be a shortest path, the optimization can be a big energy program. Once you embed an optimization, you open the door to anything from a sort to a big energy program. However, we're usually always approximating the real problem in some way because solving exactly truly sequential decision problems is almost always intractable.
[00:28:16] It's just a fundamentally intractable problem. So I've got to approximate in some way. So I'm going to come up with some reasonable approximation. I'm going to make up for that with tunable parameters. So the PFAs have tunable parameters. the CFAs have tunable parameters. But generally these are the simplest methods.
[00:28:33] These are the methods most widely used in practice. Then we get to problems where We just can't get away with these approximations, because notice the PFAs and CFAs do not look or plan into the future. So take the inventory problem. I may have to think about the Christmas rush and constraints and whatnot in the future, but the order up to policy is not planning into the future.
[00:28:58] It's just saying when the inventory goes below some level, order up to another level. Nowhere in it is it thinking about the future. my problem of finding the best drug and I've got these 24 drugs and which one is going to work best I'm not explicitly modeling into the future. There are problems like when I'm planning a path to destination, you have to plan into the future.
[00:29:20] So those are what I call the look ahead strategies and there's two ways of doing that. The first of them is with Bellman's equation where, I'm in some state of the system. I have to take an action. It takes me to a new state, and I want to know the value of being in that new state. That's Bellman's equation.
[00:29:37] And it sounds so elegant, the academics love it. Most of our books about sequential decision problems start with Bellman's equation, or if you come from Optimal Control, they'll call it Hamilton Jacobi equations.
[00:29:49] Dr Genevieve Hayes: Really interesting. Cause I realized I've come across both of those in my academic career and I never actually made the connection that that was the same.
[00:29:58] Prof Warren Powell: In fact there's a whole community that calls them H J B equations for Hamilton, Jacobi, Bellman, because they realize, okay, it's all the same thing, we'll just call it H J B equations.
[00:30:08] So what'll happen is the optimal control people will say, oh, this is great. Oh, by the way, quadratic regulation problem. We can solve it analytically, aren't we heroes? Bellman will come in and say, let's assume we have a problem where we have these discrete states and let's hope we don't have very many of these discrete states.
[00:30:26] I'm like, gosh, look, I can solve this exactly. I'm a hero. The problem is linear quadratic regulation only arises in a very tiny number of control problems and problems where you have a small enough number of discrete states almost never happens. One of the rare places it happens is shortest path problems, where you make a decision when you reach a node at the network and you have to make another decision.
[00:30:48] That's a fabulous application of Bellman's equation, and it's one of the places where everybody does Bellman's equation when they do shortest path problems. Now, I spent a good part of my career solving these transportation problems using Bellman's equation. Not exactly. I did it approximately, and that's this wonderful field called approximate dynamic programming.
[00:31:08] It's not for a small problem. It's for a very, very, very large problem of fleets of trucks. But that problem had a lot of structure. And it's sort of for what I call resource allocation problems. So I have nice applications of managing blood supplies or managing money or managing trucks. I have something in production, optimizing locomotives.
[00:31:28] As long as you're managing some nice resource, you can say, I'm going to act on the resource now. I'm going to put the driver in the truck and send them to Montana. What's the value of a truck in Montana? And so coming up with those value function approximations, I got all that worked out. It was one of my great career achievements, and it was fabulous until I started working on some other problem.
[00:31:49] And as soon as I started working on some other problem, I realized, wow, that method doesn't really work. And so. When you want to make a decision now in a way that captures not only the immediate cost, but the downstream impact approximating the value of being in a state sounds like a great approach, but once you sit down and you start really doing it.
[00:32:13] Now, this idea is so simple, people keep rediscovering it. And so Bellman discovered it in the 1950s, then he quickly realized there was this thing called the curse of dimensionality. So if you have a state variable that is not one dimension, but it's a vector, once you hit about three dimensions, the number of possible states starts to explode. So the optimal control people discovered the same idea in the 1970s. In the 1980s two scientists by the name of Rich Sutton and Andy Barto. So they were working in.
[00:32:45] Neurosciences and they had a mouse in the maze problem, and they started to realize that the mouse has to make a decision to turn left or right. And one way to think of that is to say, well, if he turns left, he'll end up at some location where he might. Find his way out or might not. And what you need is to have the value of being there.
[00:33:05] So they effectively rediscovered Bellman's equation until somebody and I've spoken to both of them and neither of them really pinpointed the point in time where somebody said, Oh, there's this field called Markov decision processes, but somewhere they stumbled into this field that Bellman invented Markov decision processes, and they went, oh my gosh, this is great, we're going to adopt it.
[00:33:28] So they adopted all of Bellman's notation, which sadly is not the best notation, and some of it is just really bad. It's great for mathematicians. Mathematicians just love it. But once you sit down to do computational work, there's this field of Markov decision processes, then they adopted one notation.
[00:33:48] There's an entirely different field called optimal control, also solving sequential decision problems, and they adopted completely different notation. The difference is Markov decision processes, the foundation was laid by mathematicians, so they didn't really care about practical things. The optimal control, this was really done by engineers.
[00:34:07] They were trying to solve actual control problems, and their notation is way more practical. the controls people introduced something that was fundamentally incredibly useful called a transition function.
[00:34:20] Whereas if you're in a state and you make a decision, they called it a control, and then new information comes in, you have some function that tells you the updated state. It's wonderful. It's powerful. Everybody uses it. But the Markov decision process people they never discover this thing called a transition function.
[00:34:39] They have what's called a transition matrix. If you're in a state S and you take an action A, what is the probability you go to some downstream state? And that looks fabulous on paper. You cannot compute it. So for anybody actually writing code, that notation is just useless. But the computer scientists, they discover the MDP notation.
[00:35:00] And they said, Oh, this is great notation. We're going to write it down one way. But when we program the computer, we're going to do something completely different. So they discovered this thing called reinforcement learning, which was. Approximating the downstream value, and it was supposed to be this great discovery and Sutton and Bartow wrote a book and it became hugely famous because they started solving some problems like chess or whatnot.
[00:35:21] And everybody's like, Oh, my God, reinforcement learning, reinforcement learning. Now what happened in computer science is exactly what happened in every other field. Everybody starts with this great hammer and they're like, wow, this is great. Look, I can hammer nails. This has happened to every sub community.
[00:35:37] But over time, they get tired of hammering that nail and they want to go and solve a different problem. Every community does this and suddenly the hammer doesn't work anymore. And what I'd love to do is to compare the first edition of Sutton and Bartow's book to the second edition. The first edition is a hammer looking for the nail.
[00:35:54] They have this one way of solving the problem. Oh my god, this is great. And the thing that Sutton and Bartow did that was so important is that they wrote the book at the level that a high school student could read it. So much of this field is written at a mathematically advanced level. And you open it up and you go, Oh my God, this is so complicated.
[00:36:12] Computer scientists are not actually very good mathematicians and they're terrible modelers. And Sudden Bartow did a brilliant job of writing it at the level of a high school student. I'm telling you, you could put that book in the bathroom and every time you go in, open up, go through a few more pages.
[00:36:29] And after a few trips, you've learned reinforcement learning.
[00:36:33] Dr Genevieve Hayes: I've read the book. It's an excellent book.
[00:36:35] Prof Warren Powell: Yeah. Well, once you get to the second edition, they went from what is 1998 to 2018. So we're talking about 20 years, a lot of time passed and the field matured. And all of a sudden they have all these other methods because they suddenly realized their first method doesn't always work. So guess what?
[00:36:54] If you know about my four classes of policies, I can go into the second edition and point to examples from each of the four classes of policies. What you won't get from the second edition is any hint of four classes of policies, or even a proper way to evaluate them. They do not write down a proper objective function for evaluating a policy.
[00:37:17] It's just something about how the field evolved. They don't write an objective function. If you go to Optimal Control people, so Jacobi. Oh my god, we're great, look, But that was back in the 1950s. It was in the 1970s where people started to say, okay, wait a minute, that doesn't always work.
[00:37:33] And they started to develop model predictive control. Model predictive control is a fancy word for look ahead. It's specifically deterministic look ahead. So Google Maps, which looks ahead is a form of model predictive control. Now. My community calls those rolling horizon procedures.
[00:37:50] Computer scientists call it receding horizon. The controls people call it model predictive control. I decided to call it, in plain English, a direct look ahead. Because you call it a direct look ahead, and now you know exactly what it is.
[00:38:02] Dr Genevieve Hayes: Yeah. You look ahead. Okay.
[00:38:05] Prof Warren Powell: use a value function, that's a way of looking ahead, but it's very complicated, because you have to approximate the value function. Now a direct look ahead, you have to make a decision. Direct look ahead is the fourth of the four classes. There's two ways to do a direct lookahead, deterministic or stochastic. So obviously most people do deterministic. So Google Maps is deterministic. You plan into the future using a point estimate of all the travel times.
[00:38:29] Now you know that you're going to update and there's going to be accidents and congestion and things like that, but you start off with a deterministic lookahead. And why would you use a lookahead and not use Bellman? Ah, let's just say computing that value function is way too expensive to be run while you're driving your car and you get an update.
[00:38:47] It's just way too complicated. You've got to solve the lookahead. Now there are going to be problems where deterministic lookahead is perfectly reasonable, especially if you can find some ways to put in some tunable parameters, like we do where I have to know how far in advance to look, Let's imagine, so Google has this information where on every link of the network they have a lot of data on the travel time, and there'll be some sort of a bell shaped curve, because sometimes there's congestion, and the times are really big, and sometimes it's really free flowing. They probably use a point estimate, an average.
[00:39:19] What if they use the 80th percentile as the 90th percentile? So, that's a way of saying, yes, I like that link, but I don't want to look at how well it looks on average. I want to look at how bad it might be. Now that's a very powerful way of handling uncertainty, and it means that the underlying algorithm is no more complicated.
[00:39:38] I'm just using a different point estimate. Instead of using the average, I'm using the 90th percentile. So when I looked at Google Maps, finding that trip back through New York City, I didn't look at how long it might take through New York City, I looked at how bad it might be.
[00:39:54] there's an entire book that develops those ideas called robust optimization.
[00:39:59] Robust optimization would also do the same thing. It would say, well, how bad might it be? So it's kind of like using a deterministic look ahead, but it's a very specially constructed terminus look ahead. My shortest path example. That's fairly simple. I mean, Google could easily do that. Now the question is, I've got this two dimensional parameter.
[00:40:17] Do I do the 90th percentile 70th? And under special conditions, there's ways to, you know, let's say I'm doing the same trick. Let's imagine that I can enter into my NAV system how risk averse I want to be. And let's imagine every time I do the trip, I see if I'm late or not, and if I come in late, I'm going to say, okay, Google, next time I do that trip, I want to be more risk averse.
[00:40:41] So I do that tuning. That, that is easily implementable with a full look ahead,
[00:40:48] Dr Genevieve Hayes: So let me get this straight. So the four classes of policies, I think policy function approximation, cost function approximation,
[00:40:55] Prof Warren Powell: value function approximation.
[00:40:57] Dr Genevieve Hayes: a value
[00:40:57] Prof Warren Powell: Value forms approximation and direct look ahead.
[00:41:01] Dr Genevieve Hayes: right.
[00:41:01] Prof Warren Powell: PFA, CFA, VFA, and DLA. And the
[00:41:04] Dr Genevieve Hayes: Okay.
[00:41:05] Prof Warren Powell: DLA, I like to split between deterministic look ahead like Google Maps and stochastic look ahead where this would be for problems where I'm like, look, I've got to model the uncertainty in the future.
[00:41:17] So you may have heard of decision trees. A decision tree is a look ahead where you model the uncertainty in the future.
[00:41:25] Dr Genevieve Hayes: I get it. I like reinforcement learning, which is a value function approximation. I studied that when I was doing my master's
[00:41:33] Prof Warren Powell: I'm sorry, what year?
[00:41:35] Dr Genevieve Hayes: 2018, I think.
[00:41:37] Prof Warren Powell: or 2018. Did you use the first edition or second edition of Sutton and Bartow?
[00:41:42] Dr Genevieve Hayes: The second edition was available. There's a free PDF on the internet at that time.
[00:41:48] So I had,
[00:41:49] a hard copy of the first edition and I was looking at the second edition as well.
[00:41:52] Prof Warren Powell: ah, okay, great. So, the thing is the first edition, you talked about value function approximations. Yes, the first edition did value function approximations. But the second edition, they did a UCB policy. So my drug example with the, hey, use the point estimate and go up two standard deviations, that's called an upper confidence bounding.
[00:42:11] They have a policy gradient method. That is taking a PFA with a tunable parameter and tuning it. That's a policy gradient method. And you have Monte Carlo Tree Search, which is a lookahead, direct lookahead. They have all four classes of policies in that book.
[00:42:26] Dr Genevieve Hayes: I just didn't notice it when I was reading.
[00:42:28] Prof Warren Powell: Yes, I know because they didn't point it out
[00:42:30] because Rich, I'm sorry.
[00:42:32] I love Rich dearly. I've had him to workshops that I did here at Princeton. But he very much you know, he's got his own research group and his own ideas. And I've been talking about the four classes of policies since 20, well, let's say 2014 and lecturing on them extensively. The thing is he still is very much geared toward He, like, for example, he covers Monte Carlo tree search, but, but only briefly.
[00:43:00] He doesn't actually explain it mathematically. If you go to chapter 19 of my book, I have all the direct look ahead. I actually have two flavors of Monte Carlo tree search classical, which is I call pessimistic Monte Carlo tree search. And then I have something new that we invented called optimistic Monte Carlo tree search.
[00:43:19] And I give the full description of the whole algorithm. But I also, when I do a stochastic look ahead, so let's say you've decided, oh, I can, okay, I'm going to do a look ahead. Now you can do a deterministic look ahead with tunable parameters. That's what most people do. But let's say that you really feel like you need a stochastic look ahead.
[00:43:39] If you do a stochastic look ahead, that means I have to optimize into the future and I have to make decisions in the future. Well, how should I make decisions? Well, I could use any of the four classes of policies. But if I'm sitting here, say, in a video game doing a stochastic look ahead I'm not going to actually implement those decisions.
[00:43:58] I'm just using it to approximate the future. And I have to do it very, very quickly. So I really would like To simulate making decisions using something fairly simple. So I cover all four classes of policies in my stochastic lookahead.
[00:44:13] Dr Genevieve Hayes: So which of these class of policies do you find to be the most useful in practice?
[00:44:18] Prof Warren Powell: Ah, thank you for that question. So let's take my four classes. Now I want to take the fourth class, the lookahead, and split it into two subclasses, direct lookahead and stochastic lookahead. So now I have five. Types of policies. The first three classes, and then two versions of direct lookaheads.
[00:44:36] Let me divide those into three categories. Category one is the one that everybody uses. PFAs, CFAs, and deterministic direct lookaheads, DLAs. Deterministic lookaheads. Those are the big three. And when you go around and you start realizing that, let's face it not everybody's a statistician, but every human being makes decisions.
[00:44:57] Every time you make a decision, here's a little challenge. Ask yourself, which class of policy did I use for that particular decision? And the vast, vast majority of time, it's going to be PFAs, CFAs, and deterministic look aheads. Now, there are going to be times when you have to think about the future and realize there's something uncertain, and the good may happen, but the bad may happen, and you have to trade those off in that stochastic look ahead.
[00:45:24] Think of a decision tree. So I put that in Category 2. I don't think it's used that often, but it is used. It's clearly used. I mean, decision trees, it's a standard thing to think about. It is harder to do. Now, the third category is the category where I spent 20 years of my life and I have a 500 page book.
[00:45:43] Value function approximation. I put that in the third category. So what you call reinforcement learning, I put in the third category. Let's face it. Go into any business. Where everybody's making decisions all over the place. Go into Wall Street, go into any medical center, anywhere, and ask if any one of them are using Bellman's equation, and they'll go, never heard of it.
[00:46:09] The journals are full of papers using Bellman's equation. Now, there are uses, there are people using it, I'm using it in my own work, Optimal Dynamics uses it, we, solve millions of these dynamic programs every day. There are people using it in energy. There absolutely are applications of value function approximations.
[00:46:26] But think of decisions like fish, tuna are really good fish. I like to eat tuna, now how many tuna are out there compared to all the other fish in the ocean? So in the universe of all the decisions being made in any subfield, how many of those decisions are being made with Bellman's equation?
[00:46:46] And I'm going to go, not very many. Okay. So while everybody's all excited about reinforcement learning, try to find real world examples of it. Even within reinforcement learning. Go to Sutton and Bartow second edition. Even they're not using it. That's why they have Monte Carlo tree search.
[00:47:04] That's why they have the policy gradient method. The way they have upper confidence bounding. They don't realize that UCB policies are a form of embedded optimization. Bartow's book was a decision that was a vector. all the decisions are scalars because that's what computer scientists work on.
[00:47:23] So, for example, if you're Amazon having to allocate inventories, that is a vector value decision.
[00:47:28] Almost any resource allocation problem you know, a shortest path problem is technically, even though at any one node, you're making a discrete decision of turning left or right. If you plan into the future, you have to plan the whole path.
[00:47:41] And that's a vector. You're not going to see vector value decisions in Sutton and Bartow. Computer scientists just don't work on them. So my book uses X for decision, because that's the notation that math programming uses. And X can be binary, X can be something discrete, and X can be some big vector. It can be anything that you want.
[00:48:03] So let's face it, there aren't any professors out there who know this material. If you go take a course in this, you're going to have a reinforcement learning course, or an optimal control course, or a dynamic programming course. So professors have all been trained in one of the specialty areas.
[00:48:16] So I say, okay, students, teach it to yourself. So I have two designs for weekly seminars. One at a more introductory level that uses the undergrad book, and a more advanced graduate level where I tell you how to go through the big book. It's 1, 100 pages. I actually say, here are the sections to read. But the undergrad book, I basically walk you through, just use the various chapters, so I was just at a military workshop, and I says, look, once you get the style of how to write up a sequential decision problem, start, throw the book out, and now shift to the problems that are familiar to you, like military problems, or if you're in health, health problems.
[00:48:54] This is easiest to learn when you're doing decision problems that are in a domain that you're familiar with. Then what you need to do is walk through a process. I have three questions that I like to ask. For any sequential decision problems, please tell me the answers to three questions, without which I cannot go anywhere. Tell me your metrics, tell me what types of decisions you're making, and what are the types of uncertainties.
[00:49:20] Now, let's say that you're not a computer geek or a math geek. Let's say, for example, you're a military officer, so I was talking to somebody at Naval Postgraduate School, and he teaches undergrads and master students, and those master students are military students. They're going into the military.
[00:49:36] They're not going to write any code, or do any math. I claim that this way of thinking about sequential decision problems helps you think about sequential decision problems. So if you're going to do a computer model, I can't build a computer model without the answers to those three questions.
[00:49:52] If you want to build a computer model, there are five elements to any sequential decision problem. State variables, decision variables, the random stuff. I call it exogenous information, the
[00:50:02] transition function, and the objective function. Those are the five elements. for any sequential decision problem. But let's say you're not going to build a model.
[00:50:10] Let's say you're not going to do any math. You just want to go out and solve real problems. I claim that you will think about your problem more clearly if you can answer those three questions. Now, the first time I tried this out was in a supply chain workshop with business executives.
[00:50:26] None of them did math or computers. And so I said, here's these three questions now, business people always understand metrics. Most people usually understand metrics. If you don't understand metrics. So let's say you have a messy problem, like climate change.
[00:50:40] What's your metric? my daughter works in health. Her first position had to do with dealing with the opioid. Epidemic and says, okay, what are your metrics? So she actually could say, okay, here are the metrics. Here's the hard one for the complicated problems decisions. What types of decisions are you making?
[00:50:57] I would get blank stares. Now nobody disagrees with the following statement. If you want to perform better in your metrics, you have to make better decisions. Everybody agrees with that. Everybody nods. Isn't it sensible to say, where's your little red book where you list what types of decisions you're making?
[00:51:16] And everybody gives me a blank stare. They have the faintest idea what I'm talking about.
[00:51:19] Dr Genevieve Hayes: So what do you mean by a type of decision? Is it, I don't know, what stock to order for a particular company?
[00:51:26] Prof Warren Powell: For example, so I was talking to an inventory expert in supply chain, and he was in love with his problem of forecasting. And I said, what decisions that support? And he says, okay, how much to order? Now a real inventory problem involves much more than just how much to order. tHere's when to order, who to order it from, there's the design of the product, there's marketing the product, there's pricing the product, there's backup inventories, there's actually a whole wealth of decisions that come together when you're running a supply chain.
[00:51:57] And yet I've been through all these supply chain books and nobody sort of clearly says, well, here's all the decisions you have to think about. So when I'm ordering inventory for the Christmas season, I have to deal with the fact that I might have something that if I don't sell it for Christmas, I'm not going to sell it.
[00:52:13] And that's fine, but I have the opportunity to do sales or to give it to some group that can get rid of it. I have all sorts of strategies of what to do afterward. You know, I can discount it if I think that I'm not going to get rid of my inventory. There's all types of decisions out there.
[00:52:27] I should at least have them in a little red book or a spreadsheet or something. See, the computer scientists, they like to do things like solve chess or computer go, you know, computer games. Now those sound very challenging. Chess is very hard, but let me tell you what's very easy, modeling it. I know what decisions, moving the pieces and what, what are the rules.
[00:52:48] It's really easy. I can write it down on half a sheet of paper. Solving it is hard, but modeling it is easy, but you go out in the real world, modeling isn't always so easy. Boy problems are easy, but real problems aren't that easy.
[00:53:01] Dr Genevieve Hayes: And what was the third question again?
[00:53:03] Prof Warren Powell: types of uncertainties.
[00:53:04] Dr Genevieve Hayes: types of uncertainties.
[00:53:05] Prof Warren Powell: was working with a chemical company, Early Keed, and they asked me, how do you come up with the types of uncertainties? And to be quite honest, it's the same way you come up with the types of decisions. Get yourself a whiteboard and a few, six packs of beer.
[00:53:18] And brainstorm now uncertainties for a complex problem. That is really tough. Imagine a supply chain and start brainstorming all the things that go wrong. I mean, there's whole books, endless books on supply chain risk and resilience, and everybody loves that word risk. And all of them have something to do with uncertainty and you get nice discussions of types of uncertainties.
[00:53:41] And you've got all these types of uncertainties up until something happens that you never thought would happen. Like the ship getting stuck in the Suez Canal, the world is a complicated place.
[00:53:52] Dr Genevieve Hayes: So, types of uncertainties, it's everything from the weather onwards, basically.
[00:53:56] Prof Warren Powell: Oh, yeah, but it gets really messy. Like the first time the SARS virus happened. I mean, who would have thunk? And then you have how people react to these, like the COVID virus. Okay. COVID hit. Yeah. But then predicting what China would do. Who guessed at that? I mean, it's these things get complicated.
[00:54:14] And the thing is you have to think about them because you want to say, well, like when the tsunami hit the Fukushima
[00:54:20] reactor and just wiped out this town, turns out that town was a place where a lot of key car components were made, and it wiped out some supply chains because of all those components that were made in that one place. Now, there are companies that do nothing but study supply chain risk. You can hire these companies to come in and help you with this. It's such a big deal, and it's complicated, and it's more than just having a list on a whiteboard because you want to have some sense of how often it might happen.
[00:54:46] Dr Genevieve Hayes: So, for listeners who are interested in learning more about sequential decision problems, where would be the best place to begin?
[00:54:54] Prof Warren Powell: Okay, I have a webpage for you, hane url.com/s DA links.
[00:55:04] It's a resources webpage. It's a webpage of links, it has books, it has videos, it has informative webpage, and it has a link to my webpage on how to teach it to yourself. At the end of all my talks, I give that one URL, tinyurl.
[00:55:21] com forward slash SDA links. And that's where I put my greatest hits of links. And that's a good starting point.
[00:55:29] I do have a video, a YouTube video. But it's on the SDA links webpage. Okay, so if you just write down that one, everything's there. There's a YouTube video that is my standard, it's about 45 minutes, it's my standard introduction. It really is the best place to start. The second thing is the undergrad book that you can download for free. Okay, the beautiful thing about is that 300 pages, it's easy to skim.
[00:55:53] Because once you read chapter one, and then you start looking at chapter two, and then three, and then you start realizing that each of the application chapters is just following the same style. You don't actually have to sit and read the whole book. You just get a sense of the style because you're not going to learn the material until you apply it to one of your own problems, and that's what really makes it to come to life.
[00:56:15] So let me tell you what started to happen to me when I started to realize that Whatever problem you have, you're going to make a decision with some method. It'll always be in one of my four classes. I stopped getting so worried about how to solve the problem and more worried about just identifying decisions.
[00:56:33] So Amazon invited a bunch of academics to a workshop and they gave us a tour of one of the fulfillment centers. And I found myself walking around that fulfillment center looking for decisions. Okay, and it's interesting because what you do is you start being more aware of oh, that's a decision.
[00:56:52] Because a lot of times we get through life, we don't really think about how many decisions we're making because you know what, we've drawn into some pattern and we're like, okay, well, this is what I usually do.
[00:57:01] That's how we get around. Like your statistics.
[00:57:02] You had a decision to get that data.
[00:57:06] You'd be amazed at how many things you do in day to day life where you had to make a decision. And yes, you could have made a different decision. And I'm not saying you need to do something complicated with Bellman's equation. I'm claiming that you already did something.
[00:57:19] And you use some method, probably a PFA or CFA or deterministic DLA.
[00:57:25] Dr Genevieve Hayes: I would say it's almost certainly going to be a PFA.
[00:57:27] Prof Warren Powell: Well, it depends because I claim that those three policies, it's sort of obvious which one you use. Google Maps is doing a deterministic look ahead for you. Anytime you're going to buy, say, a new TV, you're going to go onto the internet, look around, and you're going to trade off the lowest price with things like, is it returnable?
[00:57:47] Do they have service? Things like that. That's going to be a CFA. Okay, now what you might overlook is the fact that once you try buying that TV from Best Buy, you learn what Best Buy is like and that experience can teach you about it in the future. We often overlook the fact that it's a sequential decision and what carries over is what we learn.
[00:58:10] Dr Genevieve Hayes: I'm looking at this from the point of view of someone who fits models to training data. The standard process that data scientists are taught is fit your model to the data that you have, put the model into practice, but then the world changes, get new data, refit the model.
[00:58:28] And that sounds like what you're describing as a sequential decision process.
[00:58:33] Prof Warren Powell: Okay, so let me just first ask, that's under the assumption that whatever decisions you were making with your model is not going to affect what you observe in the future. And often that's true, but not always, so you have to be aware that I may have an energy battery. I'm going to be buying and selling energy to and from the grid in large quantities just like anybody on Wall Street running a monstrous mutual fund, how much you buy and sell will affect market prices. So you may say, I'm going to fit a linear model to get a relationship between how much I buy or sell and what price ends up getting charged. However, how much you buy and sell affects what you observe that allows me to learn about it in the future. So sometimes we are doing things automatically where maybe a little bit more thought, we'd realize, okay, I could actually do a better job of this.
[00:59:27] See, sometimes, in your example, the only thing that holds forward is what you learn, and possibly, not always, but possibly, what you learn will affect the decisions you're making. Hopefully, that's always the case, but the decisions you make may or may not affect what you observe in the future. If it does affect, then you have an active learning problem, and that is truly a fully sequential decision problem.
[00:59:53] So, I
[00:59:54] Dr Genevieve Hayes: Yes. Yeah, I understand what you're saying. Okay.
[00:59:58] Prof Warren Powell: have a slide where I have the shortest path problem on one side, And over here, I have a choice of five drugs, and which drug is best, and I don't know which one's best. A learning problem.
[01:00:09] In my shortest path problem, the state of my system is what node at the network. Am I at node two? If I'm at node two, should I turn left or right?
[01:00:16] That's the state of me. In the learning problem, my state is my beliefs, my statistical model. So whether it's frequent disturbation, it's whatever your sufficient statistics are. Okay. Both of them can be in theory solved with Bellman's equation. Both of them have a well defined state variable.
[01:00:35] You can set up Bellman's equation. However, okay, so the shortest path problem, we really do use Bellman's equation. That learning problem, almost never. Once you get aware of these four classes of policy, you stop stressing how to use it. I have endless friends in operations research. They hear the word sequential.
[01:00:53] They like, oh, Markov decision problem. The first equation they write out is Bellman's equation.
[01:00:57] They just start with Bellman's equation. There's a number of papers where they never actually write, have you ever done linear programming?
[01:01:05] Dr Genevieve Hayes: Yeah, I did.
[01:01:06] Prof Warren Powell: Okay. And so you know, every time you do a linear program, you write out an objective function with constraints.
[01:01:11] Dr Genevieve Hayes: Yes.
[01:01:11] Prof Warren Powell: Okay, fine, yes, that's the right way to do it. But sequential decision problems, do you know how many people will write out Bellman's equation? That's just like writing out complementary slackness conditions for an LP. I mean, why would you do that? There is an objective function and generally people don't write it out.
[01:01:28] It involves minimizing and maximizing it over policies, the expectation of something, okay? And people generally do not write out that objective function. You want to optimize over policies because the Bellman people go, well, you just solve Bellman's, that's the optimal policy, so we're done. Yeah. The problem is I can't solve Bellman's equation.
[01:01:46] And once you go down the road of solving it approximately, it's no longer optimal. And in fact, not only is it not only optimal, it's probably not even very good. And that's why everybody uses these other four classes of policies.
[01:01:59] Dr Genevieve Hayes: Fair enough.
[01:02:00] Prof Warren Powell: so you can understand the shift in thinking. Now, by the way, if you were interviewing me 15 years ago, I would be Mr.
[01:02:06] Approximate Dynamic Programming. Oh my God, ADP, and look at how it solves this problem, and I'm wonderful with my great new hammer. Okay. That would have been me 15 years ago. So now this is the much older and wiser me that says, oh my god, these sequential decision problems, there's a million of them, and they're just not all the same, but every way of solving them boils down in these four classes, and my category one of PFA, those are the big three, and pretty much every problem will be sort of screaming which one of those makes sense.
[01:02:39] I had a wonderful postdoc Stefan Meisel, who now teaches in Germany, and we made up an energy problem, you know, an energy storage problem. We could either buy energy from a wind farm or from the grid, and I have to meet the, you know, fairly predictable demands of a community.
[01:02:54] And then I have a giant Tesla mega variability. So we sat and made up five different variations that problem. And then we created five policies. The first four were from each of the four classes. The fifth one was a hybrid, because you can mix and match. And we made up data for five different versions, and we're able to show that each of the five policies could be made to work best on one of the problems. Now that's actually unusual because there are some problems where some of the policies are just silly. Sorry, you are not going to be able to solve a shortest path problem with a PFA.
[01:03:32] You're just not going to be able to do it. That's just silly. You're never going to find some kind of shortest path problem where a PFA will work best. You're always going to be comparing paths, which means you have an embedded optimization. And by the way, with shortest paths, you've got to do a look ahead.
[01:03:46] So now it's just down to, do you want to do a deterministic look ahead, and there's other things going on there. So some problems will scream at you, this is the obvious way to do it. Other problems you might have two or three choices. And then it boils down to not just which one works best, but you also think about how complicated is it?
[01:04:08] What's the CPU time? How much data is required? And what people often overlook when they veer toward the simpler ones is they don't really appreciate the price of simplicity as tunable parameters. And tuning is hard.
[01:04:23] So you say, Oh, well, PFA, of course I'm gonna do PFA. Guess what? Take a deep breath, take a look at the tunable parameters and tell me how you're going to tune them.
[01:04:31] What most humans do is we pick something that seems reasonable. We experience it and then we make adjustments. So in our day to day decisions, that's fine. But let's say you're with a company where there's serious money behind it. Eh, pay somebody to build a simulator and get that tuning done right.
[01:04:50] Dr Genevieve Hayes: what final advice would you give to data scientists looking to create business value from data?
[01:04:55] Prof Warren Powell: So when I hear the word data scientists, I think of somebody working with a big pile of data. this is somebody where you've got problems where you want to do better,
[01:05:04] Dr Genevieve Hayes: Yes.
[01:05:05] Prof Warren Powell: You want to do better. And the way to do better is what better decisions.
[01:05:08] You've got to start from that point. I think it's. Far more effective to come at this from perspective of an application area or problem domain. Doing this as a math problem. I just find just not very useful. You really need to start with some application area. So let's say that you have your application area and whether it's health or energy or finance or whatever.
[01:05:30] Start with your application area. And then I like my three questions. What are your metrics? Because if you want to make a good decision, you have to have a metric. There are problems out there, like who to marry. I mean, okay, let's face it, we're not going to use analytics for who to marry.
[01:05:46] Somebody's coming in for town. What's the best restaurant? You know, I'll be honest. I don't think I would use analytics for that. There are going to be problems where analytics isn't that useful, but let's say you've got one of those problems where you're doing the decisions over and over again.
[01:05:59] You've got well defined metrics. Fine. Cool. Write down those metrics. I'm working with a startup company right now. I'm doing this as a pro bono activity with Columbia University, and they're doing something where they're selling grass fed meat
[01:06:12] and I had them say, hey, list your metrics. From most important to least important list your types of decisions. Now I just gave them a matrix where I said, okay, I'm going to make the metrics columns and the decisions rose and I'm having them fill in the matrix with, whether this decision has a high impact on this metric, medium, low, none.
[01:06:34] So I'm going to have an idea of which decisions affect which metrics. Now, depending on how much time you have and your resources, obviously it would be nice to do a computer model. Not everybody has time to build a computer model. See, when you're a data scientist, you assume the data is on the computer.
[01:06:49] You start on the computer. Decisions are so ubiquitous. You can easily come up with problems where you're initially not on the computer. You just have some problem where you want to do better. Now, I'd like to be able to help people think better who are never going to put it on the computer. I mean, just having that analytical thinking, I think, helps you think about a problem.
[01:07:11] Helps you understand it, work it through, even if you never put it to numbers. Simply saying, what are my metrics and what are my decisions, is a good starting point. And then, of course, what are the uncertainties is something that you then have to think about next. Now, just having that in mind, I think you're already in a new space.
[01:07:27] Now, if it's a problem that is important enough, where there's real benefits to justify putting it on the computer, see, you're a data scientist, you're not even going to touch those problems where you're not on the computer. I can start with those problems that aren't yet on the computer, and then get to a point where I'm like, does it actually make sense?
[01:07:46] Well, let's say it does. Fine. And maybe I have a student doing a student research project, or maybe I'm in a company where I'm going to hire a consulting firm who's going to come do it, or maybe there's commercial packages out there that I can buy, and maybe there are commercial packages. One thing that I have found is because this way of thinking is so new, It's somewhat stunning how many professionals don't understand this and they actually do end up making mistakes, important mistakes, mistakes that actually matter.
[01:08:13] Okay. I had a student, one of my very, very best students, a PhD student spent eight years doing operations planning at Kimberly Clark in Brazil. She's now at. My company optimal dynamics. And she would tell these war stories of what happened doing operations planning. And at one point she says, you know what?
[01:08:31] I could really use help. This is big money. So she spent a few months working for McKinsey. She says, I'll call it McKinsey. I'll bring them in, help me out with how much inventory to order, to deal with the risk that the ship might arrive late, that type of thing. And the McKinsey came in and their only training was textbook material on inventory planning.
[01:08:50] There isn't a textbook in existence. I have a library, I have a bookshelf with about 15 books from the very most theoretical down to books that have at least some analytics, minimal, but some analytics. Not one of those books. We'll show you properly how to tune the parameters of a simple inventory policy, not one amazing. And this is supply chains. This inventory problem has been around for 70 years. And what happened 70 years ago is scarf came up with this result. This is very basic inventory problem. There's a structure for an optimal policy. It's called an SS order up to when the inventory goes below some level order up to, and that theoretical result has dominated the study of inventory problems for the.
[01:09:36] Previous 70 years and everybody wants that optimal policy. So they take that SS policy, which is fundamentally sound. And when they put into a more complicated setting, they have to put into parameters and that's fine. That's actually a perfectly reasonable thing to do. Fortunately, people in industry are not stupid.
[01:09:54] But then they're not told how to tune it. They're not told how to tune it. And that's what's amazing, that this straightforward policy that has, it actually has more than two tunable parameters, there's actually three or four. And they're not told how to tune it, and none of the books tell them how to tune it.
[01:10:09] And the packages don't do it for them. It's just stunning.
[01:10:12] Dr Genevieve Hayes: For listeners who want to learn more about you and sequential decision making what can they do?
[01:10:19] Prof Warren Powell: Well, the starting point, of course, is that tinyurl. com forward slash SDA links. Now, number two, I am posting on LinkedIn every week. I will not run out of material because it's so rich. There's topics that I haven't even gotten to in my own book. You know, you RISC. I mentioned RISC very, very briefly.
[01:10:41] I have every possible book on supply chain RISC. And not one of them define risk, and it's just stunning. I've got another thing that I'm about to get to. It's a mixture of some analytics. I mean, this is LinkedIn here. Okay, this is LinkedIn. So when I say analytics, it's at the level of a LinkedIn post.
[01:11:00] I've got about 400 pages so far of LinkedIn posts because I always write them down and then hopefully eventually I'll be able to do a second edition of my book. I have a new book that's about to come out, but you can download the book right now from the SDA Links webpage. I realized over the summer that all these schools that are teaching linear programming are teaching it wrong. They're just teaching stupid stuff. So I've written a short book. It's about a hundred pages and it's written for the professor. But it's also perfectly good for anybody who already has a course in optimization.
[01:11:34] It simply says, here's how to teach an introductory course in optimization. Now, I originally got going thinking about the sophomore level course that we teach here at Princeton, but it could also be, say, a master's student, especially to a more applied audience. But it's very, very much aimed at a introductory audience.
[01:11:51] And so what happens all around the world is that there's a lot of departments in operations research, industrial engineering, that sort of thing. Well, they'll have an introductory optimization course where they teach linear programming. Okay. When you taught linear programming, was it basically a linear programming course?
[01:12:07] Dr Genevieve Hayes: It was called managerial decision analysis and linear programming was the majority of
[01:12:13] Prof Warren Powell: Oh, okay, and you probably also did Decision Trees.
[01:12:16] Dr Genevieve Hayes: No, decision trees were in a different course.
[01:12:18] Prof Warren Powell: Oh, okay, fine. So it's basically a linear programming course. So let me ask you, can you name off the top of your head an actual real world example, and it can be something right from the course, that would be perfectly fine, of a linear program?
[01:12:30] Dr Genevieve Hayes: Honestly, the day I graduated from that course, I think was the last time I actually fit a linear program. Okay.
[01:12:37] Prof Warren Powell: Okay, first of all, one point I make is that almost nobody uses linear programming. Now, linear programming is very useful. No question, linear programming is useful. But it's like that fish in the sea argument. Yes, linear programming is useful for one very particular class of fish, like a tuna, and otherwise not.
[01:12:55] You go into areas where endless people make decision making, without any training in linear programming. Absolutely none of them have heard of it. None of them are doing it. However, here's this. I claim that most students are just like you. They take their course in linear programming, and almost none of them ever use it. But decision making is useful. So, if you go to the tinyworld. com forward slash teaching opt and download it, and I have the document both as a PDF and as a Microsoft Word.
[01:13:24] It's very easy to skim. If you have some background in optimization, it will teach you how to think about what you've learned differently. I am going to try to change how the world teaches optimization. I claim we've been teaching all of the examples that Danzig was using when he invented linear programming.
[01:13:46] All of them were from the military and they were all sequential decision problems. They were all stochastic, dynamic, and sequential. Now, he went on to also invent stochastic programming, which, I'm sorry, I have used almost every possible method for making decisions, except for one. Stochastic programming.
[01:14:04] I have found it to be useless. I just don't like it. But I have coded everything else. My lab has tried everything. I have yet to find a problem where I'm like, you know what, we really should use stochastic programming for this one.
[01:14:16] Dr Genevieve Hayes: On that note thank you for joining me today
[01:14:19] Prof Warren Powell: I'm going to enjoy this. This may be a podcast I actually listen to because I'd love to see what actually makes it off the cutting room floor.
[01:14:27] Dr Genevieve Hayes: and for those in the audience, thank you for listening. I'm Dr. Genevieve Hayes and this has been Value Driven Data Science brought to you by Genevieve Hayes Consulting.

Creators and Guests

Host

Genevieve Hayes

Guest

Warren Powell

Episode 36: Sequential Decision Problems

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere