When is being wrong evidence that you’re right?

In 2016, FiveThirtyEight predicted that Hilary Clinton would win the US presidential election. This has made a lot of people very angry and widely been regarded as a bad move.

Now (well, a couple of weeks ago, when I started writing this post), FiveThirtyEight published their first prediction for the 2020 election, and in an exciting [read: ominous] twist, the calculated odds (a ~70% chance of a Democratic (Biden) victory vs a ~30% chance of a Republican (Trump) victory) pretty much coincided with the last prediction for the 2016 election. This has given a certain amount of deja vu to the reception:

Given “what then happened”, people are feeling…a little skeptical.

Of course, there’s plenty of others saying the opposite (this is, after all, Twitter), and making the observation that the 2016 prediction is perfectly consistent with what actually happened (since 30% ≠ 0%). This is true, of course. But we might worry (as the second tweet above suggests) that this is a royal road to unfalsifiability: any prediction which assigns non-zero probabilities to every outcome is going to be consistent with all possible evidence, so this isn’t a very demanding standard. And certainly, the intuition being expressed by the skeptics is reasonably compelling. If your model predicts that A is more likely than B, and B then happens, surely that’s something that should make you reduce your faith in the model?

But a quick thought experiment suggests that this intuition can’t be held true universally. Let’s suppose first that, in fact, the final prediction for 2020 is the same as for 2016 (that is, that the prediction doesn’t change between now and November): a 70% chance of a Biden victory. And let’s imagine further that Biden does indeed win. On the intuition above, this is a good thing for the model, and should be taken as confirmation for it.

Now let’s imagine that the same thing happens in 2024: a 70/30 prediction, followed by a victory for the candidate who was given the 70% chance of winning. And then the same thing happens in 2028. And in 2032, 2036, 2040…in fact, let’s suppose that 100 elections later, as the horrifying biomechanical monstrosities of MechaTrump and CyberBiden face off against one another in the 2416 election, Nate Silver’s great-great-great-great-great-great-great-great-great-great-great-great-grandson unveils the prediction that CyberBiden has a 70% chance of beating MechaTrump – and, indeed, that this comes to pass. The above intuition suggests that each of these 100 events is confirmatory of the model, since in each case the model “got it right”. So when 2420 rolls around, our confidence in the model should (by the intuition) be extremely high.

But this doesn’t seem right. If the model is consistently predicting a 30% chance for the less favoured candidate, and yet the less favoured candidate has never once won, then that suggests the model is badly overestimating how close these elections are. Of course, it’s consistent with the model that the 30% candidate never wins – just as it’s consistent to suppose that a dice is fair, even if 100 rolls fail to ever deliver a number lower than 3. But as noted above, consistency is a very weak standard in the context of probabilistic predictions; rolling 3 or more 100 times is evidence against the fairness of the dice, and the perfect occurrence of events with “only” a 70% probability is evidence against the correctness of the model.

(Another way to put this is to link it to betting behaviour. Suppose that the New York Times model has – as is its wont – consistently assigned a 90% chance to the candidate that FiveThirtyEight assigns 70% to, and a 10% chance to the other. Then someone who was letting that model guide their betting behaviour would have done much better, over the course of the 400 years, than someone following the FiveThirtyEight model. For example, the NYT-follower would be consistently willing to offer the FiveThirtyEight-follower a bet that costs $1, and pays $4 if the less favoured candidate wins; and the FiveThirtyEight-follower will be consistently willing to accept such a bet. (Since the FiveThirtyEight-follower considers this bet to have a value of 4×0.3 = $1.20, so worth the dollar, whilst the NYT-follower considers it to have a value of 4×0.1 = $0.40.) If they make such a bet every election, the NYT-follower ends up making a tidy profit. Again, this doesn’t show conclusively that the FiveThirtyEight-follower wasn’t just very unlucky; but it’s evidence for their credences being inapt.)

And in fact, we can make this kind of intuition precise, by throwing a (pretty simplistic) Bayesian analysis at it. Just as a reminder, the Bayesian thinks that if your initial degrees of belief over some collection of propositions form a probability distribution P, then after receiving evidence E, your degrees of belief should now form a probability distribution P' such that for any proposition H, P'(H) = P(H|E) – that is, the probability (according to P) of H given E. By Bayes’ theorem, this is

P(H|E) = \frac{P(H) P(E|H)}{P(E)}

(Intuitively: your belief in the hypothesis H after receiving evidence E will be higher insofar as you already thought H was true, and insofar as (you thought) H made E likely to be true; and it will be lower to the extent that E was likely to be true independently of H.)

For example, this is going to say that if in 2016 we had some initial credence (i.e. degree of belief) C that the FiveThirtyEight model was correct, and credence T that Trump would win, then following the election (given that probability of a Trump victory assuming the correctness of the FiveThirtyEight model was 0.3) we should have credence C' that the model was correct; where

C' = \frac{C \times 0.3}{T}

Incidentally, note that whether C' > C or C' < C (or C' = C) depends, in part, on T – that is, on how likely you took a Trump victory to be, all things considered: C' < C if and only if T > 0.3. So if someone had a 50-50 credence between Trump and Clinton, then following the election their credence in the FiveThirtyEight model would be down to 60% of whatever credence in it they had previously; but if they had only a 0.2 credence in a Trump victory, then their post-election credence in FiveThirtyEight would be one and half times what it was before.

This might already seem a bit weird: shouldn’t the 2016 result just be evidence against the FiveThirtyEight model, independently of what else someone thought? How could someone end up taking that result as evidence in favour of the model? To see how that might happen, suppose that before the election I entirely divide my credences between the FiveThirtyEight and NYT models: that is, I give a 0.5 credence to each of them. Then my credence T in a Trump victory comes out as the average of the two models, i.e. at 0.2. Following the election, my updated credences are 0.25 for the NYT model and 0.75 for the FiveThirtyEight model – which makes sense, given that although the FiveThirtyEight model was wrong, at least it was less wrong than the NYT model. More generally, if I had been more sceptical about a Trump victory than FiveThirtyEight was, then the result is evidence in favour of FiveThirtyEight in at least the minimal sense that it did better than me.

This example also points to something that we’ll need to be a bit careful about. For the sake of consistency, it’ll be best to imagine that we only have credences over models, which generate probabilistic predictions over election outcomes, rather than having credences over outcomes directly. For example, it wouldn’t be consistent to have a 0.8 credence in the FiveThirtyEight model but only a 0.2 credence in a Trump victory: if I have 0.8 faith in the FiveThirtyEight model, and that model says there’s a 0.3 chance of a Trump victory, then I should have at least a 0.8 * 0.3 = 0.24 credence of a Trump victory.

We’re now in a position to play out the thought experiment above in precise terms. Let’s imagine nine models M_i, where the model M_i (always) assigns an i/10 chance to a Democrat victory: so the model M_7 is the FiveThirtyEight model, and the model M_9 is the NYT model. Let’s also suppose that I start with a uniformly distributed credence over these models, so I assign each of them a 1/9 credence; as a result, my initial credence in a Democratic victory is 0.5 (the average prediction over these models). And now let’s imagine that at each time step, the Democrats win: how does my credence distribution over the models change?

The answer is plotted out in this chart – I’ve only included the first 20 time steps (i.e. the first 80 years), since that shows most of the behaviour we’re interested in. P is the (time-evolving) credence function: so the blue line at the top, for instance, is my credence in the model M_9.

Comparing the various models, we can see that my credence in any model which assigns the Democrats a chance of 0.5 or less decays away pretty rapidly. M_9, as the model which is consistently “closest” to the outcome, continuously cannibalises credence from the others, although the growth rate slows down once it’s cemented its lead. But it’s the ones in between that are of interest to us, since they capture the observation that we started with at the beginning of this post: that being right all the time, for a probabilistic model, isn’t necessarily good news. As we can see, those models initially gain in credence as the Democratic victories come in – but as things go on and we see more victories than those models expect, the credence in them starts to die away.

Just for fun, let’s finish up by supposing that on the 21st time-step, the Democrats’ hegemony is broken by a surprise Republican victory. That means the credence-functions look like this:

Unsurprisingly, M_9 gets dethroned a bit – and the other models all get a boost in return. So these would be circumstances in which a model like M_8 (which, remember, says that there’s an 80% chance of a Democratic victory) would benefit more from a Republican victory than a Democratic one. So sometimes, evidence that a model considers somewhat (but not totally) unlikely is better evidence for that model than more of the evidence that it considers somewhat (but not totally) likely.

Changes of variables

I have a confession to make: I am confused about changes of variables.

Now, don’t get me wrong. I know how to change variables in a given problem. But (as is often the trigger for me realising I don’t have as good a grasp on something as I’d like) I’m trying to write up teaching notes on changes of variables, in a precise and principled way, and keep finding myself bogged down. So this post will be an effort to, as it were, think aloud through the issues that come up.

Let’s start with a nice easy example. Suppose that we’re thinking about the equation of the unit circle, in x-, y-coordinates:

x^2 + y^2 = 1

And now suppose that we want to convert to two-dimensional polar coordinates (r, \theta):

r = \sqrt{x^2 + y^2}

\theta = \tan^{-1}(\frac{y}{x})

Now, if we want to know the equation of the unit circle in these new coordinates, we substitute r and \theta into the above equation, using the inverse expressions to those above:

x = r \cos \theta

y = r \sin \theta

Which gets us the expected result, r = 1. In this case, the use of the inverse expressions maybe doesn’t seem totally necessary since the original expression already only depended on x and y via x^2 + y^2; but it should hopefully be clear that that’s a peculiarity of this simple example.

Now, I want to claim that even this relatively simple case picks up on a few potentially confusing things. For starters, we might be a little perturbed by the fact that the thing we actually use to convert our first equation (in terms of x and y) into our second (in terms of r and \theta) isn’t the definition of the new coordinates in terms of the old, but the inverse expressions that define the old coordinates in terms of the new. More than that, what – when you get down to it – is an expression like x = r \cos \theta suppose to mean? In particular, is this the defining equation for a real-valued function of r and \theta, or a recipe licencing the replacement of the simple expression “x” by the complex expression “r \cos \theta“? Or, if it’s both at once, then how do these two functions intertwine?

Let’s take a bit of a step back, then. We started with an equation, formulated in a particular mathematical language (namely, one employing the variables x and y). This equation may be regarded as a certain kind of syntactic object, formulated in a particular mathematical language: one featuring, in particular, the variables x and y, in addition to the usual algebraic symbols (+, etc.). We’ll think of this as analogous to a formula formed in a first-order or propositional language.

On the semantic side, this equation is to be interpreted over \mathbb{R}^2. We’re helping ourselves to the idea that this structure has been parameterised by x and y; that is to say, an element of \mathbb{R}^2 is really a map p from \{\ulcorner x \urcorner, \ulcorner y \urcorner\} to \mathbb{R}; I’m using corner-quotes here to stress that it’s the symbols that are being mapped to points of \mathbb{R}. (This is supposed to resemble the manner in which an interpretation of a logical formula is a map from the set of propositional or predicate symbols to truth-values or sets of tuples.) Such a point satisfies the equation if p(\ulcorner x \urcorner)^2 + p(\ulcorner y \urcorner)^2 = 1. A solution to this equation is a set of points in \mathbb{R}^2, such that each point in the set satisfies this equation.

All this might seem like a massive palaver to go through, just in order to state the elementary fact that a solution to the equation x^2 + y^2 = 1 is a set of points (x, y) for which the equation is true. And of course, that’s certainly true; but the hope is that by being as careful as possible, and in particular by distinguishing starkly between syntactic and semantic considerations, we can figure out what’s going on when we start working with new variables.

(Incidentally, as a side-note to the above: if we do want to think of the equation as analogous to a first-order formula like \forall x Px, there’s an interesting question about what the proper analogue of a first-order model of that formula ought to be. Is it an individual point in \mathbb{R}^2 satisfying the equation, or a solution to the equation?)

On the other side of the divide, we have an equation in a different language – namely, one having the same algebraic and operation symbols, etc., but having instead the variables r and \theta. This equation just reads r = 1. (Let’s forget, for the moment, that we obtained this equation from the previous one through some substitution work.) On the semantic side, this language is also interpreted over points of \mathbb{R}^2, as parameterised by r and \theta: that is, it is interpreted by maps from \{\ulcorner r \urcorner, \ulcorner \theta \urcorner\} to \mathbb{R}. We also include the specifications (i) that we’ll only consider maps p such that p(\ulcorner r \urcorner) \geq 0, and (ii) that \theta is periodic, so that if p(\ulcorner r \urcorner) = q (\ulcorner r \urcorner) and p(\ulcorner \theta \urcorner) = q (\ulcorner \theta \urcorner) + 2\pi, then p = q. (Which means that we’re not really interpreting it over \mathbb{R}^2 after all, but rather over the half-cylinder obtained by gluing \{(r, \theta) : r \geq 0, 0 \leq \theta < 2\pi\} \subset \mathbb{R}^2 to itself.) So analogously to what we had before, a point p satisfies the equation if p(\ulcorner r \urcorner) = 1.

Now, there is meant to be some sense in which these two equations “have the same content”. What sense is that? Well, here are two relevant observations. First, at the syntactic level: uniformly substituting \ulcorner r \cos \theta \urcorner for \ulcorner x \urcorner and \ulcorner r \sin \theta \urcorner for \ulcorner y \urcorner in the former equation yields the latter, and uniformly substituting \ulcorner x^2 + y^2 \urcorner for \ulcorner r \urcorner (and \ulcorner\tan^{-1}(y/x)\urcorner for \ulcorner \theta \urcorner) in the latter equation yields the former. Moreover, applying these two substitutions to \ulcorner x \urcorner in turn yields an expression (namely, \ulcorner \sqrt{x^2 + y^2} \cos(\tan^{-1}(y/x)) \urcorner) that is provably equivalent to \ulcorner x \urcorner; the same goes for all the other variables. In other words, we’ve got a pair of mutually inverse “translations” between the two equations.

At the semantic level, suppose I fix the following map F from \mathbb{R}^2, parameterised by x and y, to the half-cylinder constructed from \mathbb{R}^2 and parameterised by r and \theta: given any p: \{\ulcorner x \urcorner, \ulcorner y \urcorner\} \to \mathbb{R}, F(p) : \{\ulcorner r \urcorner, \ulcorner \theta \urcorner\} \to \mathbb{R} according to

F(p) : \ulcorner r \urcorner \mapsto \sqrt{ p(\ulcorner x \urcorner)^2 + p(\ulcorner y \urcorner)^2}

F(p) : \ulcorner \theta \urcorner \mapsto \tan^{-1}(p(\ulcorner y \urcorner)/ p(\ulcorner x \urcorner))

This map then has the following feature: given any point p \in \mathbb{R}^2, if $p$ satisfies the first equation then F(p) satisfies the second equation. Similarly, we can construct a map G from the half-cylinder to \mathbb{R}^2, which will have the feature that for any point q of the former, if q satisfies the second equation then G(q) satisfies the first equation. And moreover, a little work will show that F and G are inverse to one another (well, other than places like r = 0).

The reason this is interesting is that it mirrors what happens with translations in formal languages: there too, one has both syntactic translations between expressions and semantic maps between interpretational structures. Moreover, it draws out one of the things that makes changes of variables so confusing (at least to me!): I started out by saying that an expression like x = r \cos \theta seems to be both a specification of a function, and a substitution manual. We’ve now got a more precise grip on what is meant by this – these two identities correspond to the semantic and syntactic maps just described. But note that (as, again, happens with formal languages) the two maps associated to an expression like x = r \cos \theta “go in opposite directions”, i.e. are dual concepts. The syntactic side of the coin is a map from expressions in the (x, y)-language to expressions in the (r, \theta)-language; the semantic side of the coin, however, is a map from points of the (r, \theta)-structure (the half-cylinder) to points of the (x, y)-structure. (That is, the semantic map is half of the instructions needed to construct G above; if I’d thought this through more carefully I’d have explicitly constructed G rather than F, but the point stands – in the construction of F we used the expressions r = \sqrt{x^2 + y^2} and \theta = \tan^{-1}(y/x).)

In any event, all of this was supposed to just be a warmup to the really tricky case – doing all this for differential equations, not just algebraic equations. Still, writing this has helped me unpick some of the tangle (admittedly, which might well be a tangle I have entirely created for myself); extending this to differential equations will have to wait for another day.