Motivating the metric of regret in sequential decision making.
This is the second live blog of Lecture 7 of my graduate seminar “Feedback, Learning, and Adaptation.” A table of contents is here.
When talking about sequential decision making and optimal control, I can’t avoid discussing the mathematical concept of regret. Regret is the preferred theoretical metric for evaluating bandit algorithms, and bandit algorithms are a core method for online decision-making.
Invariably, every time I try to explain “regret,” I get the question, “Wait, why do we care about that?” So to have an answer that I can point to in the future, I’m going to write a few blog posts.
We can use the setup from the last blog. We have a two-player game with repeated interactions. In every round t,
Information xt is revealed to both players.
Player One takes action ut
Player Two takes action dt
A score rt is assigned based on the triple (xt,ut,dt).
Player One is the “decision maker,” and their action has to be computable from a few lines of code. Their goal is to accumulate as high a score as possible, summed across all rounds. Player two wants the sum of all of the rt to be as low as possible.
You could think of designing the optimal policy as an optimization problem. Given a description of what Player Two can do, you can design a strategy for Player One. If Player Two is totally random, you could perhaps maximize the expected score. If Player Two is deterministic, you could maximize the score assuming Player Two plays the best possible strategy against you. Regret provides a flexible framework for going beyond both of these formulations.
To define regret, we imagine a counterfactual world in which Player One knows something about Player Two’s strategy in advance. The regret of some strategy is the difference between the score of a policy built on knowing this secret of Player Two and the score of the strategy that has to learn as it goes. It is called regret because it estimates how much the score could be improved with the benefit of hindsight.
Sometimes this regret is really large. Consider the following example. In each round, Player Two thinks of a color, Red or Blue, and Player One has to guess which color Player Two is thinking of. Player One gets a 1 if they guess correctly and a 0 otherwise. Player Two agrees to choose their sequence in advance, but only reveal one number to Player One in each round. In the counterfactual world, Player One would know the entire sequence and would receive a perfect score. In reality, Player One can’t do better than guessing, so they would be hard-pressed to get more than half of the colors correct.
This is where regret gets confusing. We ask, what if Player One in the counterfactual world has the benefit of hindsight but is constrained in their strategies? In this color guessing game, what if Player One is forced to choose one color in their counterfactual world? They see the entire sequence but can only pick red or blue. In this case, if Player Two chooses an even number of Red and Blues, the omniscient yet restricted Player One can only get half of the answers correct. A real-world strategy of random guessing will fare just as well as this counterfactual strategy with the benefit of hindsight.
No matter how many times I explain it, I find this setup confusing. Let me write it again: The regret model requires two things: a secret of Player Two and a restricted strategy set of Player One. In the real world, Player One has a flexible strategy set, but is missing information. In the counterfactual world, Player One has a restricted strategy set, but extra knowledge. Regret bounds the difference in scores achieved in these two worlds.
You might ask why this particular example of color guessing is interesting. I’m not sure it is, but it’s the one we’ll use next week when discussing forecasting. When someone tells you that they have calibrated predictions, they are doing this sort of sleight of hand and comparing against something that you probably don’t actually care about.
But let’s spend some time discussing examples where regret is reasonable. I’ll start with the canonical example: the stochastic multiarmed bandit. If Player Two is random and stationary, then the best strategy in hindsight makes a lot more sense. In our game of colors, this is the multiarmed bandit problem, the most annoyingly named subject in decision making. In the classic version of this problem, you have two slot machines and want to find the one with the highest payout. Each round, you are allowed to choose one of the machines to play. We model the payout from each machine as an independent draw from a fixed probability distribution. These distributions have different means, and your goal is to devise a strategy that results in the highest expected payout.
What would the best policy do? No matter how clever you are, you can’t beat the strategy of only using the machine with the higher mean payout. If you knew the expected payouts in advance and your goal is to maximize the expected payout, you would use only the machine with the highest expected payout. Thus, we can think of the secret held by Player Two to be the mean payouts of each machine.
If you didn’t know this secret, what would you do? You’d probably spend some time with each machine, look at which one is giving you higher returns, and then pick that one forever. This seems like a reasonable strategy.1 But note that this strategy necessarily has nonzero regret, because you necessarily have to try both machines to figure out which one is best.
Any strategy you devise for the real world has a particular expected regret, which is the difference between the expected payout of playing the best machine and the expected value of your strategy. In the case of our multiarmed bandit, the worst regret is accrued by always playing the suboptimal machine. So the regret would grow linearly with the number of pulls. Bandit algorithms seek strategies for which regret grows sublinearly.
Outside the casino, variants of the stochastic multiarmed bandit are reasonable models for adaptive experimentation. Suppose you want to select between treatments A and B that maximizes the average benefit to some cohort of subjects. If you can randomly sample individuals from the cohort, there will be regret associated with the number of subjects assigned to the suboptimal treatment in a randomized experiment, and there will be regret associated with the chance your experiment selects the suboptimal treatment. You would like to minimize application of the wrong treatment, but also be pretty sure you are finding the right one. You can compare this to the policy that assigned everyone to the optimal policy in advance.
Tomorrow I’ll talk through three other examples where regret feels like the right concept to me. In hindsight, it’s not always worth the headaches and confusion associated with regret minimization, but there are enough positive examples to make it a concept worth understanding.
This is more or less the optimal strategy.



