In the last post, I described Leonard Savage’s argument that we could score predictions based on their utility. If a forecast will be used to compute expected returns, there is no “incentive” to report probabilities that the forecaster didn’t believe in.
I worked out the abstract case for this last week, but today let me make it concrete as it will let me finally make “rigorous” a puzzling quote by Frank Ramsey:
“The old-established way of measuring a person's belief is to propose a bet, and see what are the lowest odds which he will accept.”
Since I’m going to be “rigorous,” this post will have too much math. But it will also feature degenerate gambling. So hopefully something for everyone.
Let’s suppose we have a single event we’re betting on. There are four possible outcomes: if you bet a dollar that the event happens and the event happens, you receive B dollars. If you bet a dollar that the event won’t happen and the event doesn’t happen, you receive 1/B dollars. If you bet incorrectly, you lose your dollar. Up to the bookie costs, most sports betting works in this fashion. The product of the fractional payout for betting on a team winning or losing is always close to one.
Suppose you have a model predicting that the event happens with probability Q. How much of your pot should you bet? You could just maximize the expected value of your return. Suppose you wager the fraction A of your total money (so A is between 0 and 1). Then, if you bet on the event happening, the expected return on your bet is
Q A B - (1-Q) A
That is, if the event happens (with probability Q), you win AB dollars. If the event doesn’t happen (probability 1-Q), you lose A dollars. Using the same reasoning, the expected return if you bet against the event is
-Q A + (1-Q) A/B.
Under your probabilistic model of the future, only one of these bets can give a positive expected value, and that’s the one you’ll choose. Specifically, you should bet on the event if the odds the model places on the event happening exceeds 1/B. In math, if
you should bet on the event, and you should bet against otherwise.
Now, if you want to maximize your expected value, you’ll put all of your money on your chosen bet. This seems, well, degenerate, but let’s see what we can learn from the degens. Certainly, a degenerate bet gives us a way to evaluate a forecaster’s probabilistic predictions. If the actual probability of the event is P, then the actual expected return on the bet is
This is the expected value with respect to P of the return with the policy πQ that was designed using Q.
Note that a forecaster who correctly estimates the probability of the event as P will have not only a positive expected return but the maximal possible expected return. But there are many Qs that get the same maximal return.
Let me summarize the situation so far. Eliciting degenerate bets yields a proper scoring rule for probability distributions. A scoring rule is a function that takes a forecaster’s distribution Q and compares it to the probabilistic nature of reality P. For every event Y, a scoring rule assigns a number evaluating the quality of the forecast Q. If Y happens, it is R(Q,Y). The scoring rule is then
A scoring rule S(Q,P) is proper if S(Q,P) is less than or equal to S(P,P) for all Q and P. That is, the score function is maximized by the “true distribution” of nature. A strictly proper scoring rule is one where S(Q,P) is strictly less than S(P,P) for all Q that are not equal to P. Strictly proper rules are preferable because a forecaster who guesses the precise true pattern of future outcomes achieves the optimal score.
In our degenerate gambling setup, we have derived a proper scoring rule but not a strictly proper scoring rule. Interestingly, we can get proper scoring rules back by adding a bit of risk aversion to the utility maximization problem.
How should we gamble if we don’t want to be degens? A gambling strategy that dates back to Bernoulli suggests that you should maximize the rate of return over several rounds of gambling and not just try to win it all in one hand. This turns out to be equivalent to maximizing the geometric mean of your returns.
This model assumes a hypothetical infinite sequence of bets on events whose outcome has probability Q. The utility is the long-run average rate of return for always allocating a bet on the event, whose size is equal to the fraction A of your total assets.
If you prefer to think in terms of logarithms, this utility maximization problem is equivalent to
The optimal solution of this problem is called the Kelly Criterion. You bet
A = Q - (1-Q)/B
on the event happening, provided that the odds of the event happening are greater than or equal to 1/B. If the odds are worse than 1/B, you bet against the event with allocation
A = (1-Q) - Q B
The expected value of the Kelly Bet, assuming the probability is P has a fascinating form:
where H(P,Q) is the cross-entropy of Q relative to P. The second term is not affected by the choice of Q. It is the expected return of the policy derived from the true distribution P. We have
Where the left-hand side is the negative Kullback-Leibler Divergence between P and Q. What the KL divergence means is the topic of another post. But for today, note that the KL divergence is always positive and equals zero only if P=Q. That is, this particular utility maximization problem yields a strictly proper scoring rule. Moreover, the scoring rule is always morally equal to the KL divergence, no matter the relationship between the offered bet size and the probability of the event.
You might guess (and you’d be correct) that you get the Brier Score by using a different risk-aversion model. If your utility were regularized by a quadratic term
then you get a proper scoring rule with a quadratic distance measured between P and Q. In the case of even payouts (B=1), the induced scoring rule is the Brier Score.
Though I never see these scores motivated this way in the recent literature, I find Savage’s appeal to Homo Economus instructive. In all of the papers and articles I’ve read about proper scoring rules, the promoters often argue forecasters should not lean too heavily on prediction accuracy. They advocate for calibration and appropriate “distributional sharpness.” I’ve tried but never understood why exactly these properties are the gold standard of what we should strive for.
The utility maximization framework is far more explicit. The scoring rule comes out of explicitly articulating our expectations of the forecasting system, and the penalty will be a direct measurement of the deviation of the expectations from reality. When we explicitly declare the evaluation rules, we get a scoring rule for free. It’s interesting to me that in order to get a strictly proper rule, we have to add an element of risk aversion to the utility maximization.
I think risk aversion is what the statisticians are after, too. Statisticians argue that scoring rules are supposed to enable decision-makers to hedge against uncertainty. For example, when Frank Harrell eloquently describes his advocacy for scoring rules, he explicitly argues for risk aversion. But if risk aversion is the intended goal, why not be explicit about it and precisely declare the purported value of a probabilistic forecast? Value is only clear if we articulate how we will evaluate.
Subscribe now
By Ben Recht