Theorem of the Week: the Łos-Tarski preservation theorem

(This may or may not become a regular feature; at the moment, the “theorem of the week” title is more aspirational than anything else…)

So, this week I came across a nice theorem, that I imagine is well known to model theory people, but which seems to me to have a some potentially cool philosophy of science implications. The theorem in question is the Łos-Tarski preservation theorem (not to be confused with what’s usually called “Łos’s theorem”, which is the Łos ultraproduct theorem – although that’s certainly a contender to be a future Theorem of the Week). Informally, the theorem says that a set of first-order formulas is preserved under the taking of substructures (over the models of a fixed theory T) if and only if all the formulas in it are universal (modulo T). So first, let’s spell out both halves of this – starting with the right-hand-side, which turns out to be the simpler one.

So, on the one hand, a formula is universal if it’s constructed purely out of quantifier-free formulas using conjunction, disjunction, and universal quantification. To say that \Phi is equivalent modulo T to a set of universal formulas means that there’s a set \Psi of universal formulas such that for any model A of T, a sequence (a_1, a_2, \dots) satisfies all the formulas in \Phi iff it satisfies all the formulas in \Psi.

On the other hand, a substructure of a first-order structure A is a structure whose domain is a subset of A‘s, and whose extensions are obtained by restricting the extensions on A to that subset. More formally, B is a substructure of A iff \textrm{dom}(B) \subseteq \textrm{dom}(A) and for any predicate P in the signature, P^{B} = P^A|_{\textrm{dom}(B)}. Then, to say that a set of formulas \Phi is preserved under substructures over the models of T means the following: given any models A, B of T such that B is a substructure of A, for any (possibly infinite) sequence (b_1,  b_2, \dotsc,) of elements of B, if (b_1, b_2 \dotsc) satisfies all the formulas in \Phi in A, then it satisfies all the formulas in \Phi in B.

In other words, preservation under substructures means that any time a bunch of formulas are all true of some individuals in a given model of T, then they will also all be true of those individuals in any submodel of that model (provided that the individuals are included in the submodel). The simplest case is where \Phi only contains sentences, as then it just boils down to the condition that if they’re all true of a given model, they’re also all true of any submodel. In sum, then, there’s a sense in which this property means that the set of formulas \Phi exhibits a kind of mereological harmony: it’s true of the whole only if it’s true of the parts.

To see how this condition could fail to hold, think of an existentially quantified statement: “there be dragons”, say (\exists x Dx). Just because there are dragons somewhere in a model, it doesn’t follow that any submodel has to contain dragons. So naively, we can anticipate that having existential quantifiers around might problematise preservation under substructures. On the other hand, it seems pretty natural to think that universal quantifiers aren’t a problem. If something’s true of everything in a model, then we’d expect it to still be true of everything in every substructure (since the everything in a substructure is just a part of the everything in the whole structure). More precisely, we’d expect that if a set \Phi of formulas is equivalent modulo T to a set of universal formulas, then the set \Phi will be preserved under taking substructures on models of T—which is, of course, the right-to-left half of the Łos-Tarski theorem. And, indeed, this turns out to be reasonably easy to prove.

What’s interesting, though, is the other direction. This tells us that if our set of formulas is preserved under taking substructures, then it has to be equivalent to a set of universal formulas. Hence, an apparently purely semantic property of a set of formulas—staying true of individuals, even when you chuck away other stuff—turns out to strongly pin down the syntactic character of that set. This makes the situation rather reminiscent of Beth’s theorem, which also states that a semantic property (implicit definability) and a syntactic property (explicit definability) turn out to coincide, even though at first glance the syntactic property looks stronger. And in fact, the parallels to Beth’s theorem seem to go deeper than this, although I admit that here we’re getting to the technical details I haven’t properly looked into yet: but Hodges’s proof of Beth’s theorem merely notes that it follows “from Theorem 6.6.1 as [the Łos-Tarski theorem] followed from Theorem 6.5.1”. This would also suggest that the Łos-Tarski theorem, like Beth’s theorem, will fail to hold in higher-order contexts, which might temper some of the philosophical lessons that can be drawn from this; still, let’s put that to one side.

Alright. So we’ve learned that being a set of universal formulas is, more or less, the same thing as being preserved in substructures. The reason why this seems interesting to me is how it bears on the question of what characterises a law of nature. There is a long-standing tradition in philosophy of science of taking laws of nature to paradigmatically be universally quantified statements. More specifically, in fact, there’s a tradition of thinking that laws of nature have to be of the form \forall x (Fx \rightarrow Gx), but let’s just go with the weaker claim.

What other properties do laws of nature have to have? Again, there’s a traditional answer: the laws have to be necessary, in some appropriate sense of that word. The problem with this, of course, is that it’s not at all clear how the requirement that laws be necessary could ever be empirically checked, at least insofar as “necessity” is understood as a matter of holding at all possible worlds: we cannot, after all, pop across to some nearby possible worlds to check whether or not the laws hold there. Of course, one could derive necessity from the laws instead, by taking it that necessity (at least, the relevant kind of necessity—presumably, nomological necessity) is defined by the laws. Unfortunately, although it will now be analytic (and so empirically verifiable, if only vacuously) that the laws are necessary, this is at the price of making necessity into no kind of constraint on the laws at all.

A better answer, I think, is that the laws are required to be—in effect—preserved under the taking of substructures. Of course, in the context of scientific theories we don’t tend to talk about substructures; instead, we talk about subsystems. But the point stands: by and large, we expect that when the laws hold of a system, they will also hold of the subsystems which make it up. This is a point that a few people have made recently. For instance, Mike Hicks has a paper on an “epistemic role account” of Humeanism which argues that Humeans should require best systems to exhibit virtues connected to the possibility of experimentally testing theories; he then goes on to observe that

The requirement that the lawbook be supportable by observation or experiments, then, constrains our lawbook as follows: to perform experiments, we need laws which can be observed in isolated subsystems of the universe…[this] requires the laws to apply to subsystems of the universe as well as the universe as a whole.

Hicks, “Dynamics Humeanism”

I also take this to be an idea that David Wallace has been arguing for: I’m not sure if there’s a neat quote like this one, but it’s pretty clearly there in the papers on Fundamental and Emergent Geometry in Newtonian Physics and observability and symmetry. No doubt there are others too – this seems like the kind of thing that pops up in different places, rather than being discussed in a single debate.

(Incidentally, there are some subtleties about whether we should think of the laws as holding of arbitrary subsystems, or only of those subsystems which are reasonably isolated. I tend toward the former, since most laws will let you stick in background structure to represent whatever the subsystem’s environment is up to (this is an observation due to David Wallace) – but others, including Hicks and Schaffer, and (I think, though I’ve not read the paper yet!) Hicks and Demarest, seem to disagree.)

At any rate, the philosophy-of-science relevance of the Łos-Tarski theorem is then to point us toward a potentially profound connection between these two conditions. At least insofar as the theorem holds, the only way for a law to be preserved under substructures is for it to be universal in form. This suggests a way to push back, then, against the lawlikeness of anything which doesn’t have such a universal form: e.g. the Past Hypothesis, or the assertion that there exists a quantum wavefunction with such-and-such properties (as some Bohmians, eg Goldstein Zanghi and Dürr, have proposed). Note that this wouldn’t mean arguing that existential claims are never a legitimate part of scientific theories—just that we shouldn’t characterise such claims as falling under the laws. Going back to the Łos-Tarski theorem, bear in mind that there are no constraints on the “background” theory T, which can have existential quantification in it up to the nines.

So the picture that emerges is of a two-tiered kind of scientific laws: a background theory which postulates existence claims (say, “there exists such-and-such spacetime structures, and particles with such-and-such properties”), overlaid with a set of universal claims (“all particles must move and interact, relative to one another and the spacetime structure, in such-and-such a way”). This mirrors Tim Maudlin’s picture of theories as giving first an “ontology” and then a “nomology”; what is neat, though, is that (if you’ll grant that nomology has to be preserved under substructures), the Łos-Tarski theorem means that these two aspects of scientific theories will be intimately bound up with the difference between the existential and the universal modes of quantification.

When is being wrong evidence that you’re right?

In 2016, FiveThirtyEight predicted that Hilary Clinton would win the US presidential election. This has made a lot of people very angry and widely been regarded as a bad move.

Now (well, a couple of weeks ago, when I started writing this post), FiveThirtyEight published their first prediction for the 2020 election, and in an exciting [read: ominous] twist, the calculated odds (a ~70% chance of a Democratic (Biden) victory vs a ~30% chance of a Republican (Trump) victory) pretty much coincided with the last prediction for the 2016 election. This has given a certain amount of deja vu to the reception:

Given “what then happened”, people are feeling…a little skeptical.

Of course, there’s plenty of others saying the opposite (this is, after all, Twitter), and making the observation that the 2016 prediction is perfectly consistent with what actually happened (since 30% ≠ 0%). This is true, of course. But we might worry (as the second tweet above suggests) that this is a royal road to unfalsifiability: any prediction which assigns non-zero probabilities to every outcome is going to be consistent with all possible evidence, so this isn’t a very demanding standard. And certainly, the intuition being expressed by the skeptics is reasonably compelling. If your model predicts that A is more likely than B, and B then happens, surely that’s something that should make you reduce your faith in the model?

But a quick thought experiment suggests that this intuition can’t be held true universally. Let’s suppose first that, in fact, the final prediction for 2020 is the same as for 2016 (that is, that the prediction doesn’t change between now and November): a 70% chance of a Biden victory. And let’s imagine further that Biden does indeed win. On the intuition above, this is a good thing for the model, and should be taken as confirmation for it.

Now let’s imagine that the same thing happens in 2024: a 70/30 prediction, followed by a victory for the candidate who was given the 70% chance of winning. And then the same thing happens in 2028. And in 2032, 2036, 2040…in fact, let’s suppose that 100 elections later, as the horrifying biomechanical monstrosities of MechaTrump and CyberBiden face off against one another in the 2416 election, Nate Silver’s great-great-great-great-great-great-great-great-great-great-great-great-grandson unveils the prediction that CyberBiden has a 70% chance of beating MechaTrump – and, indeed, that this comes to pass. The above intuition suggests that each of these 100 events is confirmatory of the model, since in each case the model “got it right”. So when 2420 rolls around, our confidence in the model should (by the intuition) be extremely high.

But this doesn’t seem right. If the model is consistently predicting a 30% chance for the less favoured candidate, and yet the less favoured candidate has never once won, then that suggests the model is badly overestimating how close these elections are. Of course, it’s consistent with the model that the 30% candidate never wins – just as it’s consistent to suppose that a dice is fair, even if 100 rolls fail to ever deliver a number lower than 3. But as noted above, consistency is a very weak standard in the context of probabilistic predictions; rolling 3 or more 100 times is evidence against the fairness of the dice, and the perfect occurrence of events with “only” a 70% probability is evidence against the correctness of the model.

(Another way to put this is to link it to betting behaviour. Suppose that the New York Times model has – as is its wont – consistently assigned a 90% chance to the candidate that FiveThirtyEight assigns 70% to, and a 10% chance to the other. Then someone who was letting that model guide their betting behaviour would have done much better, over the course of the 400 years, than someone following the FiveThirtyEight model. For example, the NYT-follower would be consistently willing to offer the FiveThirtyEight-follower a bet that costs $1, and pays $4 if the less favoured candidate wins; and the FiveThirtyEight-follower will be consistently willing to accept such a bet. (Since the FiveThirtyEight-follower considers this bet to have a value of 4×0.3 = $1.20, so worth the dollar, whilst the NYT-follower considers it to have a value of 4×0.1 = $0.40.) If they make such a bet every election, the NYT-follower ends up making a tidy profit. Again, this doesn’t show conclusively that the FiveThirtyEight-follower wasn’t just very unlucky; but it’s evidence for their credences being inapt.)

And in fact, we can make this kind of intuition precise, by throwing a (pretty simplistic) Bayesian analysis at it. Just as a reminder, the Bayesian thinks that if your initial degrees of belief over some collection of propositions form a probability distribution P, then after receiving evidence E, your degrees of belief should now form a probability distribution P' such that for any proposition H, P'(H) = P(H|E) – that is, the probability (according to P) of H given E. By Bayes’ theorem, this is

P(H|E) = \frac{P(H) P(E|H)}{P(E)}

(Intuitively: your belief in the hypothesis H after receiving evidence E will be higher insofar as you already thought H was true, and insofar as (you thought) H made E likely to be true; and it will be lower to the extent that E was likely to be true independently of H.)

For example, this is going to say that if in 2016 we had some initial credence (i.e. degree of belief) C that the FiveThirtyEight model was correct, and credence T that Trump would win, then following the election (given that probability of a Trump victory assuming the correctness of the FiveThirtyEight model was 0.3) we should have credence C' that the model was correct; where

C' = \frac{C \times 0.3}{T}

Incidentally, note that whether C' > C or C' < C (or C' = C) depends, in part, on T – that is, on how likely you took a Trump victory to be, all things considered: C' < C if and only if T > 0.3. So if someone had a 50-50 credence between Trump and Clinton, then following the election their credence in the FiveThirtyEight model would be down to 60% of whatever credence in it they had previously; but if they had only a 0.2 credence in a Trump victory, then their post-election credence in FiveThirtyEight would be one and half times what it was before.

This might already seem a bit weird: shouldn’t the 2016 result just be evidence against the FiveThirtyEight model, independently of what else someone thought? How could someone end up taking that result as evidence in favour of the model? To see how that might happen, suppose that before the election I entirely divide my credences between the FiveThirtyEight and NYT models: that is, I give a 0.5 credence to each of them. Then my credence T in a Trump victory comes out as the average of the two models, i.e. at 0.2. Following the election, my updated credences are 0.25 for the NYT model and 0.75 for the FiveThirtyEight model – which makes sense, given that although the FiveThirtyEight model was wrong, at least it was less wrong than the NYT model. More generally, if I had been more sceptical about a Trump victory than FiveThirtyEight was, then the result is evidence in favour of FiveThirtyEight in at least the minimal sense that it did better than me.

This example also points to something that we’ll need to be a bit careful about. For the sake of consistency, it’ll be best to imagine that we only have credences over models, which generate probabilistic predictions over election outcomes, rather than having credences over outcomes directly. For example, it wouldn’t be consistent to have a 0.8 credence in the FiveThirtyEight model but only a 0.2 credence in a Trump victory: if I have 0.8 faith in the FiveThirtyEight model, and that model says there’s a 0.3 chance of a Trump victory, then I should have at least a 0.8 * 0.3 = 0.24 credence of a Trump victory.

We’re now in a position to play out the thought experiment above in precise terms. Let’s imagine nine models M_i, where the model M_i (always) assigns an i/10 chance to a Democrat victory: so the model M_7 is the FiveThirtyEight model, and the model M_9 is the NYT model. Let’s also suppose that I start with a uniformly distributed credence over these models, so I assign each of them a 1/9 credence; as a result, my initial credence in a Democratic victory is 0.5 (the average prediction over these models). And now let’s imagine that at each time step, the Democrats win: how does my credence distribution over the models change?

The answer is plotted out in this chart – I’ve only included the first 20 time steps (i.e. the first 80 years), since that shows most of the behaviour we’re interested in. P is the (time-evolving) credence function: so the blue line at the top, for instance, is my credence in the model M_9.

Comparing the various models, we can see that my credence in any model which assigns the Democrats a chance of 0.5 or less decays away pretty rapidly. M_9, as the model which is consistently “closest” to the outcome, continuously cannibalises credence from the others, although the growth rate slows down once it’s cemented its lead. But it’s the ones in between that are of interest to us, since they capture the observation that we started with at the beginning of this post: that being right all the time, for a probabilistic model, isn’t necessarily good news. As we can see, those models initially gain in credence as the Democratic victories come in – but as things go on and we see more victories than those models expect, the credence in them starts to die away.

Just for fun, let’s finish up by supposing that on the 21st time-step, the Democrats’ hegemony is broken by a surprise Republican victory. That means the credence-functions look like this:

Unsurprisingly, M_9 gets dethroned a bit – and the other models all get a boost in return. So these would be circumstances in which a model like M_8 (which, remember, says that there’s an 80% chance of a Democratic victory) would benefit more from a Republican victory than a Democratic one. So sometimes, evidence that a model considers somewhat (but not totally) unlikely is better evidence for that model than more of the evidence that it considers somewhat (but not totally) likely.

Changes of variables

I have a confession to make: I am confused about changes of variables.

Now, don’t get me wrong. I know how to change variables in a given problem. But (as is often the trigger for me realising I don’t have as good a grasp on something as I’d like) I’m trying to write up teaching notes on changes of variables, in a precise and principled way, and keep finding myself bogged down. So this post will be an effort to, as it were, think aloud through the issues that come up.

Let’s start with a nice easy example. Suppose that we’re thinking about the equation of the unit circle, in x-, y-coordinates:

x^2 + y^2 = 1

And now suppose that we want to convert to two-dimensional polar coordinates (r, \theta):

r = \sqrt{x^2 + y^2}

\theta = \tan^{-1}(\frac{y}{x})

Now, if we want to know the equation of the unit circle in these new coordinates, we substitute r and \theta into the above equation, using the inverse expressions to those above:

x = r \cos \theta

y = r \sin \theta

Which gets us the expected result, r = 1. In this case, the use of the inverse expressions maybe doesn’t seem totally necessary since the original expression already only depended on x and y via x^2 + y^2; but it should hopefully be clear that that’s a peculiarity of this simple example.

Now, I want to claim that even this relatively simple case picks up on a few potentially confusing things. For starters, we might be a little perturbed by the fact that the thing we actually use to convert our first equation (in terms of x and y) into our second (in terms of r and \theta) isn’t the definition of the new coordinates in terms of the old, but the inverse expressions that define the old coordinates in terms of the new. More than that, what – when you get down to it – is an expression like x = r \cos \theta suppose to mean? In particular, is this the defining equation for a real-valued function of r and \theta, or a recipe licencing the replacement of the simple expression “x” by the complex expression “r \cos \theta“? Or, if it’s both at once, then how do these two functions intertwine?

Let’s take a bit of a step back, then. We started with an equation, formulated in a particular mathematical language (namely, one employing the variables x and y). This equation may be regarded as a certain kind of syntactic object, formulated in a particular mathematical language: one featuring, in particular, the variables x and y, in addition to the usual algebraic symbols (+, etc.). We’ll think of this as analogous to a formula formed in a first-order or propositional language.

On the semantic side, this equation is to be interpreted over \mathbb{R}^2. We’re helping ourselves to the idea that this structure has been parameterised by x and y; that is to say, an element of \mathbb{R}^2 is really a map p from \{\ulcorner x \urcorner, \ulcorner y \urcorner\} to \mathbb{R}; I’m using corner-quotes here to stress that it’s the symbols that are being mapped to points of \mathbb{R}. (This is supposed to resemble the manner in which an interpretation of a logical formula is a map from the set of propositional or predicate symbols to truth-values or sets of tuples.) Such a point satisfies the equation if p(\ulcorner x \urcorner)^2 + p(\ulcorner y \urcorner)^2 = 1. A solution to this equation is a set of points in \mathbb{R}^2, such that each point in the set satisfies this equation.

All this might seem like a massive palaver to go through, just in order to state the elementary fact that a solution to the equation x^2 + y^2 = 1 is a set of points (x, y) for which the equation is true. And of course, that’s certainly true; but the hope is that by being as careful as possible, and in particular by distinguishing starkly between syntactic and semantic considerations, we can figure out what’s going on when we start working with new variables.

(Incidentally, as a side-note to the above: if we do want to think of the equation as analogous to a first-order formula like \forall x Px, there’s an interesting question about what the proper analogue of a first-order model of that formula ought to be. Is it an individual point in \mathbb{R}^2 satisfying the equation, or a solution to the equation?)

On the other side of the divide, we have an equation in a different language – namely, one having the same algebraic and operation symbols, etc., but having instead the variables r and \theta. This equation just reads r = 1. (Let’s forget, for the moment, that we obtained this equation from the previous one through some substitution work.) On the semantic side, this language is also interpreted over points of \mathbb{R}^2, as parameterised by r and \theta: that is, it is interpreted by maps from \{\ulcorner r \urcorner, \ulcorner \theta \urcorner\} to \mathbb{R}. We also include the specifications (i) that we’ll only consider maps p such that p(\ulcorner r \urcorner) \geq 0, and (ii) that \theta is periodic, so that if p(\ulcorner r \urcorner) = q (\ulcorner r \urcorner) and p(\ulcorner \theta \urcorner) = q (\ulcorner \theta \urcorner) + 2\pi, then p = q. (Which means that we’re not really interpreting it over \mathbb{R}^2 after all, but rather over the half-cylinder obtained by gluing \{(r, \theta) : r \geq 0, 0 \leq \theta < 2\pi\} \subset \mathbb{R}^2 to itself.) So analogously to what we had before, a point p satisfies the equation if p(\ulcorner r \urcorner) = 1.

Now, there is meant to be some sense in which these two equations “have the same content”. What sense is that? Well, here are two relevant observations. First, at the syntactic level: uniformly substituting \ulcorner r \cos \theta \urcorner for \ulcorner x \urcorner and \ulcorner r \sin \theta \urcorner for \ulcorner y \urcorner in the former equation yields the latter, and uniformly substituting \ulcorner x^2 + y^2 \urcorner for \ulcorner r \urcorner (and \ulcorner\tan^{-1}(y/x)\urcorner for \ulcorner \theta \urcorner) in the latter equation yields the former. Moreover, applying these two substitutions to \ulcorner x \urcorner in turn yields an expression (namely, \ulcorner \sqrt{x^2 + y^2} \cos(\tan^{-1}(y/x)) \urcorner) that is provably equivalent to \ulcorner x \urcorner; the same goes for all the other variables. In other words, we’ve got a pair of mutually inverse “translations” between the two equations.

At the semantic level, suppose I fix the following map F from \mathbb{R}^2, parameterised by x and y, to the half-cylinder constructed from \mathbb{R}^2 and parameterised by r and \theta: given any p: \{\ulcorner x \urcorner, \ulcorner y \urcorner\} \to \mathbb{R}, F(p) : \{\ulcorner r \urcorner, \ulcorner \theta \urcorner\} \to \mathbb{R} according to

F(p) : \ulcorner r \urcorner \mapsto \sqrt{ p(\ulcorner x \urcorner)^2 + p(\ulcorner y \urcorner)^2}

F(p) : \ulcorner \theta \urcorner \mapsto \tan^{-1}(p(\ulcorner y \urcorner)/ p(\ulcorner x \urcorner))

This map then has the following feature: given any point p \in \mathbb{R}^2, if $p$ satisfies the first equation then F(p) satisfies the second equation. Similarly, we can construct a map G from the half-cylinder to \mathbb{R}^2, which will have the feature that for any point q of the former, if q satisfies the second equation then G(q) satisfies the first equation. And moreover, a little work will show that F and G are inverse to one another (well, other than places like r = 0).

The reason this is interesting is that it mirrors what happens with translations in formal languages: there too, one has both syntactic translations between expressions and semantic maps between interpretational structures. Moreover, it draws out one of the things that makes changes of variables so confusing (at least to me!): I started out by saying that an expression like x = r \cos \theta seems to be both a specification of a function, and a substitution manual. We’ve now got a more precise grip on what is meant by this – these two identities correspond to the semantic and syntactic maps just described. But note that (as, again, happens with formal languages) the two maps associated to an expression like x = r \cos \theta “go in opposite directions”, i.e. are dual concepts. The syntactic side of the coin is a map from expressions in the (x, y)-language to expressions in the (r, \theta)-language; the semantic side of the coin, however, is a map from points of the (r, \theta)-structure (the half-cylinder) to points of the (x, y)-structure. (That is, the semantic map is half of the instructions needed to construct G above; if I’d thought this through more carefully I’d have explicitly constructed G rather than F, but the point stands – in the construction of F we used the expressions r = \sqrt{x^2 + y^2} and \theta = \tan^{-1}(y/x).)

In any event, all of this was supposed to just be a warmup to the really tricky case – doing all this for differential equations, not just algebraic equations. Still, writing this has helped me unpick some of the tangle (admittedly, which might well be a tangle I have entirely created for myself); extending this to differential equations will have to wait for another day.