When is being wrong evidence that you’re right?

In 2016, FiveThirtyEight predicted that Hilary Clinton would win the US presidential election. This has made a lot of people very angry and widely been regarded as a bad move.

Now (well, a couple of weeks ago, when I started writing this post), FiveThirtyEight published their first prediction for the 2020 election, and in an exciting [read: ominous] twist, the calculated odds (a ~70% chance of a Democratic (Biden) victory vs a ~30% chance of a Republican (Trump) victory) pretty much coincided with the last prediction for the 2016 election. This has given a certain amount of deja vu to the reception:

Given “what then happened”, people are feeling…a little skeptical.

Of course, there’s plenty of others saying the opposite (this is, after all, Twitter), and making the observation that the 2016 prediction is perfectly consistent with what actually happened (since 30% ≠ 0%). This is true, of course. But we might worry (as the second tweet above suggests) that this is a royal road to unfalsifiability: any prediction which assigns non-zero probabilities to every outcome is going to be consistent with all possible evidence, so this isn’t a very demanding standard. And certainly, the intuition being expressed by the skeptics is reasonably compelling. If your model predicts that A is more likely than B, and B then happens, surely that’s something that should make you reduce your faith in the model?

But a quick thought experiment suggests that this intuition can’t be held true universally. Let’s suppose first that, in fact, the final prediction for 2020 is the same as for 2016 (that is, that the prediction doesn’t change between now and November): a 70% chance of a Biden victory. And let’s imagine further that Biden does indeed win. On the intuition above, this is a good thing for the model, and should be taken as confirmation for it.

Now let’s imagine that the same thing happens in 2024: a 70/30 prediction, followed by a victory for the candidate who was given the 70% chance of winning. And then the same thing happens in 2028. And in 2032, 2036, 2040…in fact, let’s suppose that 100 elections later, as the horrifying biomechanical monstrosities of MechaTrump and CyberBiden face off against one another in the 2416 election, Nate Silver’s great-great-great-great-great-great-great-great-great-great-great-great-grandson unveils the prediction that CyberBiden has a 70% chance of beating MechaTrump – and, indeed, that this comes to pass. The above intuition suggests that each of these 100 events is confirmatory of the model, since in each case the model “got it right”. So when 2420 rolls around, our confidence in the model should (by the intuition) be extremely high.

But this doesn’t seem right. If the model is consistently predicting a 30% chance for the less favoured candidate, and yet the less favoured candidate has never once won, then that suggests the model is badly overestimating how close these elections are. Of course, it’s consistent with the model that the 30% candidate never wins – just as it’s consistent to suppose that a dice is fair, even if 100 rolls fail to ever deliver a number lower than 3. But as noted above, consistency is a very weak standard in the context of probabilistic predictions; rolling 3 or more 100 times is evidence against the fairness of the dice, and the perfect occurrence of events with “only” a 70% probability is evidence against the correctness of the model.

(Another way to put this is to link it to betting behaviour. Suppose that the New York Times model has – as is its wont – consistently assigned a 90% chance to the candidate that FiveThirtyEight assigns 70% to, and a 10% chance to the other. Then someone who was letting that model guide their betting behaviour would have done much better, over the course of the 400 years, than someone following the FiveThirtyEight model. For example, the NYT-follower would be consistently willing to offer the FiveThirtyEight-follower a bet that costs $1, and pays $4 if the less favoured candidate wins; and the FiveThirtyEight-follower will be consistently willing to accept such a bet. (Since the FiveThirtyEight-follower considers this bet to have a value of 4×0.3 = $1.20, so worth the dollar, whilst the NYT-follower considers it to have a value of 4×0.1 = $0.40.) If they make such a bet every election, the NYT-follower ends up making a tidy profit. Again, this doesn’t show conclusively that the FiveThirtyEight-follower wasn’t just very unlucky; but it’s evidence for their credences being inapt.)

And in fact, we can make this kind of intuition precise, by throwing a (pretty simplistic) Bayesian analysis at it. Just as a reminder, the Bayesian thinks that if your initial degrees of belief over some collection of propositions form a probability distribution P, then after receiving evidence E, your degrees of belief should now form a probability distribution P' such that for any proposition H, P'(H) = P(H|E) – that is, the probability (according to P) of H given E. By Bayes’ theorem, this is

P(H|E) = \frac{P(H) P(E|H)}{P(E)}

(Intuitively: your belief in the hypothesis H after receiving evidence E will be higher insofar as you already thought H was true, and insofar as (you thought) H made E likely to be true; and it will be lower to the extent that E was likely to be true independently of H.)

For example, this is going to say that if in 2016 we had some initial credence (i.e. degree of belief) C that the FiveThirtyEight model was correct, and credence T that Trump would win, then following the election (given that probability of a Trump victory assuming the correctness of the FiveThirtyEight model was 0.3) we should have credence C' that the model was correct; where

C' = \frac{C \times 0.3}{T}

Incidentally, note that whether C' > C or C' < C (or C' = C) depends, in part, on T – that is, on how likely you took a Trump victory to be, all things considered: C' < C if and only if T > 0.3. So if someone had a 50-50 credence between Trump and Clinton, then following the election their credence in the FiveThirtyEight model would be down to 60% of whatever credence in it they had previously; but if they had only a 0.2 credence in a Trump victory, then their post-election credence in FiveThirtyEight would be one and half times what it was before.

This might already seem a bit weird: shouldn’t the 2016 result just be evidence against the FiveThirtyEight model, independently of what else someone thought? How could someone end up taking that result as evidence in favour of the model? To see how that might happen, suppose that before the election I entirely divide my credences between the FiveThirtyEight and NYT models: that is, I give a 0.5 credence to each of them. Then my credence T in a Trump victory comes out as the average of the two models, i.e. at 0.2. Following the election, my updated credences are 0.25 for the NYT model and 0.75 for the FiveThirtyEight model – which makes sense, given that although the FiveThirtyEight model was wrong, at least it was less wrong than the NYT model. More generally, if I had been more sceptical about a Trump victory than FiveThirtyEight was, then the result is evidence in favour of FiveThirtyEight in at least the minimal sense that it did better than me.

This example also points to something that we’ll need to be a bit careful about. For the sake of consistency, it’ll be best to imagine that we only have credences over models, which generate probabilistic predictions over election outcomes, rather than having credences over outcomes directly. For example, it wouldn’t be consistent to have a 0.8 credence in the FiveThirtyEight model but only a 0.2 credence in a Trump victory: if I have 0.8 faith in the FiveThirtyEight model, and that model says there’s a 0.3 chance of a Trump victory, then I should have at least a 0.8 * 0.3 = 0.24 credence of a Trump victory.

We’re now in a position to play out the thought experiment above in precise terms. Let’s imagine nine models M_i, where the model M_i (always) assigns an i/10 chance to a Democrat victory: so the model M_7 is the FiveThirtyEight model, and the model M_9 is the NYT model. Let’s also suppose that I start with a uniformly distributed credence over these models, so I assign each of them a 1/9 credence; as a result, my initial credence in a Democratic victory is 0.5 (the average prediction over these models). And now let’s imagine that at each time step, the Democrats win: how does my credence distribution over the models change?

The answer is plotted out in this chart – I’ve only included the first 20 time steps (i.e. the first 80 years), since that shows most of the behaviour we’re interested in. P is the (time-evolving) credence function: so the blue line at the top, for instance, is my credence in the model M_9.

Comparing the various models, we can see that my credence in any model which assigns the Democrats a chance of 0.5 or less decays away pretty rapidly. M_9, as the model which is consistently “closest” to the outcome, continuously cannibalises credence from the others, although the growth rate slows down once it’s cemented its lead. But it’s the ones in between that are of interest to us, since they capture the observation that we started with at the beginning of this post: that being right all the time, for a probabilistic model, isn’t necessarily good news. As we can see, those models initially gain in credence as the Democratic victories come in – but as things go on and we see more victories than those models expect, the credence in them starts to die away.

Just for fun, let’s finish up by supposing that on the 21st time-step, the Democrats’ hegemony is broken by a surprise Republican victory. That means the credence-functions look like this:

Unsurprisingly, M_9 gets dethroned a bit – and the other models all get a boost in return. So these would be circumstances in which a model like M_8 (which, remember, says that there’s an 80% chance of a Democratic victory) would benefit more from a Republican victory than a Democratic one. So sometimes, evidence that a model considers somewhat (but not totally) unlikely is better evidence for that model than more of the evidence that it considers somewhat (but not totally) likely.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s