To take the climate example, say scientists had figured out that there were a biological feedback that kicks in once global warming has gone past 2C (e.g. bacteria become more efficient at decomposing soil and releasing CO2). Suppose you have one model that includes a representation of that feedback (e.g. as a subprocess) and one that does not but is equivalent in every other way (e.g. is coded like the first model but lacks the subprocess). Then isn’t the second model simpler according to metrics like the minimum description length, so that it would be weighted higher if we penalised models using such metrics? But this seems the wrong thing to do, if we think the first model is more likely to give a good prediction.
The trick here is that the data on which the model is trained/fit has to include whatever data the scientists used to learn about that feedback loop in the first place. As long as that data is included, the model which accounts for it will have lower minimum description length. (This fits in with a general theme: the minimum-complexity model is simple and general-purpose; the details are learned from the data.)
Now the thought that occurred to me when writing that is that the data the scientists used to deduce the existence of the feedback ought to be accounted for by the models that are used, and this would give low posterior weight to models that don’t include the feedback. But doing this in practice seems hard.
… I’m responding as I read. Yup, exactly. As the Bayesians say, we do need to account for all our prior information if we want reliably good results. In practice, this is “hard” in the sense of “it requires significantly more complicated programming”, but not in the sense of “it increases the asymptotic computational complexity”. The programming is more complicated mainly because the code needs to accept several qualitatively different kinds of data, and custom code is likely needed for hooking up each of them. But that’s not a fundamental barrier; it’s still the same computational challenges which make the approach impractical.
it’s not clear to me if there would be a way to tell between models that represent the process but don’t connect it properly to predicting the climate...
Again, we need to include whatever data allowed scientists to connect it to the climate in the first place. (In some cases this is just fundamental physics, in which case it’s already in the model.)
If our models were deterministic, then if they were not true, wouldn’t it be impossible for them to produce the observed data exactly, so that the likelihood of the data given any of those models would be zero? (Unless there was more than one process that could give rise to the same data, which seems unlikely in practice.)
Picture a deterministic model which uses fundamental physics, and models the joint distribution of position and momentum of every atom comprising the Earth. The unknown in this model is the initial conditions—the initial position and momentum of every particle (also particle identity, i.e. which element/isotope each is, but we’ll ignore that). Now, imagine how many of the possible initial conditions are compatible with any particular high-level data we observe. It’s a massive number!
Point is: the deterministic part of a model of a fundamental physical model is the dynamics; the initial conditions are still generally unknown. Conceptually, when we fit the data, we’re mostly looking for initial conditions which match. So zero likelihoods aren’t really an issue; the issue is computing with a joint distribution over position and momentum of so many particles. That’s what statistical mechanics is for.
whilst statistical averaging has got human modellers a certain distance, adding representations of processes whose effects get missed by the averaging seems to add a lot of value
The corresponding problem in statistical mechanics is to identify the “state variables”—the low-level variables whose averages correspond to macroscopic observables. For instance, the ideal gas law uses density, kinetic energy, and force on container surfaces (whose macroscopic averages correspond to density, temperature, and pressure). Fluid flow, rather than averaging over the whole system, uses density and particle velocity within each little cell of space.
The point: if an effect is “missed by averaging”, that’s usually not inherent to averaging as a technique. The problem is that people average over poorly-chosen features.
Jaynes argued that the key to choosing high-level features is reproducibility: what high-level variables do experimenters need to control in order to get a consistent result distribution? If we consistently get the same results without holding X constant (where X includes e.g. initial conditions of every particle), then apparently X isn’t actually relevant to the result, so we can average out X. Also note that there’s some degrees of freedom in what “results” we’re interested in. For instance, turbulence has macroscopic behavior which depends on low-level initial conditions, but the long-term time average of forces from a turbulent flow usually doesn’t depend on low-level initial conditions—and for engineering purposes, it’s often that time average which we actually care about.
if the models we can afford to compute with can’t reproduce the data, then presumably they are also not reproducing the correct causal graph exactly? And any causal graph we could compute with will not be able to reproduce the data?
Once we move away from stat mech and approximations of low-level models, yes, this becomes a problem. However, two counterpoints. First, this is the sort of problem where the output says “well, the best model is one with like a gazillion edges, and there’s a bunch that all fit about equally well, so we have no idea what will happen going forward”. That’s unsatisfying, but at least it’s not wrong. Second, if we do get that sort of result, then it probably just isn’t possible to do better with the high-level variables chosen. Going back to reproducibility and selection of high-level variables: if we’ve omitted some high-level variable which really does impact the results we’re interested in, then “we have no idea what will happen going forward” really is the right answer.
I think I need to think more about the likelihood issue. I still feel like we might be thinking about different things—when you say “a deterministic model which uses fundamental physics”, this would not be in the set of models that we could afford to run to make predictions for complex systems. For the models we could afford to run, it seems to me that no choice of initial conditions would lead them to match the data we observe, except by extreme coincidence (analogous to a simple polynomial just happening to pass through all the datapoints produced by a much more complex function).
I’ve gone through Jaynes’ paper now from the link you gave. His point about deciding what macroscopic variables matter is well-made. But you still need a model of how the macroscopic variables you observe relate to the ones you want to predict. In modelling atmospheric processes, simple spatial averaging of the fluid dynamics equations over resolved spatial scales gets you some way, but then changing the form of the function relating the future to present states (“adding representations of processes” as I put it before) adds additional skill. And Jaynes’ paper doesn’t seem to say how you should choose this function.
For the models we could afford to run, it seems to me that no choice of initial conditions would lead them to match the data we observe, except by extreme coincidence (analogous to a simple polynomial just happening to pass through all the datapoints produced by a much more complex function).
Ok, let’s talk about computing with error bars, because it sounds like that’s what’s missing from what you’re picturing.
The usual starting point is linear error—we assume that errors are small enough for linear approximation to be valid. (After this we’ll talk about how to remove that assumption.) We have some multivariate function f(x) - imagine that x is the full state of our simulation at some timestep, and f calculates the state at the next timestep. The value ¯¯¯x of x in our program is really just an estimate of the “true” value x; it has some error Δx=x−¯¯¯x. As a result, the value of ¯¯¯f of f in our program also has some error Δf=f−¯¯¯f. Assuming the error is small enough for linear approximation to hold, we have:
Δf=f−¯¯¯f=f(¯¯¯x+Δx)−f(¯¯¯x)≈(dfdx|¯¯¯x)Δx
where dfdx is the Jacobian, i.e. the matrix of derivatives of every entry of f(x) with respect to every entry of x.
Next, assume that Δx has covariance matrix Σx, and we want to compute the covariance matrix Σf of Δf. We have a linear relationship between Δx and Δf, so we use the usual formula for linear transformation of covariance:
Σf=dfdxTΣxdfdx
Now imagine iterating this at every timestep: we compute the timestep itself, then differentiate that timestep, and matrix multiply our previous uncertainty on both sides by the derivative matrix to get the new uncertainty:
¯¯¯x(t+1)=f(¯¯¯x(t))
Σx(t+1)=(dfdx|¯¯¯x(t))TΣx(t)(dfdx|¯¯¯x(t))
Now, a few key things to note:
For most systems of interest, that uncertainty is going to grow over time, usually exponentially. That’s correct: in a chaotic system, if the initial conditions are uncertain, then of course we should become more and more uncertain about the system’s state over time.
Those formulas only propagate uncertainty in previous state to uncertainty in the next state. Really, there’s also new uncertainty introduced at each timestep, e.g. from error in f itself (i.e. due to averaging) or from whatever’s driving the system. Typically, such errors are introduced as an additive term—i.e. we compute the covariance in x introduced by each source of error, and add them to the propagated covariance matrix at each timestep.
Actually storing the whole covariance matrix would take O(n2) space if x has n elements, which is completely impractical when x is the whole state of a finite element simulation. We make this practical the same way we make all matrix operations practical in numerical computing: exploit sparsity/structure. This is application-specific, but usually the covariance can be well-approximated as the sum of sparse “local” covariances and low-rank “global” covariances.
Likewise with the update: we don’t actually want to compute the n-by-n derivative matrix and then matrix-multiply with the covariance. Most backpropagation libraries expose the derivative as a linear operator rather than an explicit matrix, and we want to use it that way. Again, specifics will vary, depending on the structure of f and of the (approximated) covariance matrix.
In many applications, we have data coming in over time. That data reduces our uncertainty every time it comes in—at that point, we effectively have a Kalman filter. If enough data is available, the uncertainty remains small enough for the linear approximation to continue to hold, and whole thing works great.
If the uncertainty does become too large for linear approximation, then we need to resort to other methods for representing uncertainty, rather than just a covariance matrix. Particle filters are one simple-but-effective fallback, and can be combined with linear uncertainty as well.
In general, if this sounds interesting and you want to know more, it’s covered in a lot of different contexts. I first saw most of it in an autonomous vehicles course; besides robotics, it’s also heavily used in economic models, and sometimes systems/control theory courses will focus on this sort of stuff.
Is this starting to sound like a model for which the observed data would have nonzero probability?
Do you mean you’d be adding the probability distribution with that covariance matrix on top of the mean prediction from f, to make it a probabilistic prediction? I was talking about deterministic predictions before, though my text doesn’t make that clear. For probabilistic models, yes adding an uncertainty distribution may make result in non-zero likelihoods. But if we know the true dynamics are deterministic (pretend there’s no quantum effects, which are largely irrelevant for our prediction errors for systems in the classical physics domain), then we still know the model is not true, and so it seems difficult to interpret p if we were to do Bayesian updating.
Likelihoods are also not obviously (to me) very good measures of model quality for chaotic systems, either—in these cases we know that even if we had the true model, its predictions would diverge from reality due to errors in the initial condition estimates, but it would trace out the correct attractor—and its the attractor geometry (conditional on boundary conditions) that we’d seem to really want to assess. Perhaps then it would have a higher likelihood than every other model, but it’s not obvious to me, and it’s not obvious that there’s not a better metric for leading to good inferences when we don’t have the true model.
Basically the logic that says to use Bayes for deducing the truth does not seem to carry over in an obvious way (to me) to the case when we want to predict but can’t use the true model.
But if we know the true dynamics are deterministic (pretend there’s no quantum effects, which are largely irrelevant for our prediction errors for systems in the classical physics domain), then we still know the model is not true, and so it seems difficult to interpret p if we were to do Bayesian updating.
Ah, that’s where we need to apply more Bayes. The underlying system may be deterministic at the macroscopic level, but that does not mean we have perfect knowledge of all the things which effect the system’s trajectory. Most of the uncertainty in e.g. a weather model would not be quantum noise, it would be things like initial conditions, measurement noise (e.g. how close is this measurement to the actual average over this whole volume?), approximation errors (e.g. from discretization of the dynamics), driving conditions (are we accounting for small variations in sunlight or tidal forces?), etc. The true dynamics may be deterministic, but that doesn’t mean that our estimates of all the things which go into those dynamics have no uncertainty. If the inputs have uncertainty (which of course they do), then the outputs also have uncertainty.
The main point of probabilistic models is not to handle “random” behavior in the environment, it’s to quantify uncertainty resulting from our own (lack of) knowledge of the system’s inputs/parameters.
Likelihoods are also not obviously (to me) very good measures of model quality for chaotic systems, either—in these cases we know that even if we had the true model, its predictions would diverge from reality due to errors in the initial condition estimates, but it would trace out the correct attractor...
Yeah, you’re pointing to an important issue here, although it’s not actually likelihoods which are the problem—it’s point estimates. In particular, that makes linear approximations a potential issue, since they’re implicitly approximations around a point estimate. Something like a particle filter will do a much better job than a Kalman filter at tracing out an attractor, since it accounts for nonlinearity much better.
Anyway, reasoning with likelihoods and posterior distributions remains valid regardless of whether we’re using point estimates. When the system is chaotic but has an attractor, the posterior probability of the system state will end up smeared pretty evenly over the whole attractor. (Although with enough fine-grained data, we can keep track of roughly where on the attractor the system is at each time, which is why Kalman-type models work well in that case.)
The trick here is that the data on which the model is trained/fit has to include whatever data the scientists used to learn about that feedback loop in the first place. As long as that data is included, the model which accounts for it will have lower minimum description length. (This fits in with a general theme: the minimum-complexity model is simple and general-purpose; the details are learned from the data.)
… I’m responding as I read. Yup, exactly. As the Bayesians say, we do need to account for all our prior information if we want reliably good results. In practice, this is “hard” in the sense of “it requires significantly more complicated programming”, but not in the sense of “it increases the asymptotic computational complexity”. The programming is more complicated mainly because the code needs to accept several qualitatively different kinds of data, and custom code is likely needed for hooking up each of them. But that’s not a fundamental barrier; it’s still the same computational challenges which make the approach impractical.
Again, we need to include whatever data allowed scientists to connect it to the climate in the first place. (In some cases this is just fundamental physics, in which case it’s already in the model.)
Picture a deterministic model which uses fundamental physics, and models the joint distribution of position and momentum of every atom comprising the Earth. The unknown in this model is the initial conditions—the initial position and momentum of every particle (also particle identity, i.e. which element/isotope each is, but we’ll ignore that). Now, imagine how many of the possible initial conditions are compatible with any particular high-level data we observe. It’s a massive number!
Point is: the deterministic part of a model of a fundamental physical model is the dynamics; the initial conditions are still generally unknown. Conceptually, when we fit the data, we’re mostly looking for initial conditions which match. So zero likelihoods aren’t really an issue; the issue is computing with a joint distribution over position and momentum of so many particles. That’s what statistical mechanics is for.
The corresponding problem in statistical mechanics is to identify the “state variables”—the low-level variables whose averages correspond to macroscopic observables. For instance, the ideal gas law uses density, kinetic energy, and force on container surfaces (whose macroscopic averages correspond to density, temperature, and pressure). Fluid flow, rather than averaging over the whole system, uses density and particle velocity within each little cell of space.
The point: if an effect is “missed by averaging”, that’s usually not inherent to averaging as a technique. The problem is that people average over poorly-chosen features.
Jaynes argued that the key to choosing high-level features is reproducibility: what high-level variables do experimenters need to control in order to get a consistent result distribution? If we consistently get the same results without holding X constant (where X includes e.g. initial conditions of every particle), then apparently X isn’t actually relevant to the result, so we can average out X. Also note that there’s some degrees of freedom in what “results” we’re interested in. For instance, turbulence has macroscopic behavior which depends on low-level initial conditions, but the long-term time average of forces from a turbulent flow usually doesn’t depend on low-level initial conditions—and for engineering purposes, it’s often that time average which we actually care about.
Once we move away from stat mech and approximations of low-level models, yes, this becomes a problem. However, two counterpoints. First, this is the sort of problem where the output says “well, the best model is one with like a gazillion edges, and there’s a bunch that all fit about equally well, so we have no idea what will happen going forward”. That’s unsatisfying, but at least it’s not wrong. Second, if we do get that sort of result, then it probably just isn’t possible to do better with the high-level variables chosen. Going back to reproducibility and selection of high-level variables: if we’ve omitted some high-level variable which really does impact the results we’re interested in, then “we have no idea what will happen going forward” really is the right answer.
Thanks again.
I think I need to think more about the likelihood issue. I still feel like we might be thinking about different things—when you say “a deterministic model which uses fundamental physics”, this would not be in the set of models that we could afford to run to make predictions for complex systems. For the models we could afford to run, it seems to me that no choice of initial conditions would lead them to match the data we observe, except by extreme coincidence (analogous to a simple polynomial just happening to pass through all the datapoints produced by a much more complex function).
I’ve gone through Jaynes’ paper now from the link you gave. His point about deciding what macroscopic variables matter is well-made. But you still need a model of how the macroscopic variables you observe relate to the ones you want to predict. In modelling atmospheric processes, simple spatial averaging of the fluid dynamics equations over resolved spatial scales gets you some way, but then changing the form of the function relating the future to present states (“adding representations of processes” as I put it before) adds additional skill. And Jaynes’ paper doesn’t seem to say how you should choose this function.
Ok, let’s talk about computing with error bars, because it sounds like that’s what’s missing from what you’re picturing.
The usual starting point is linear error—we assume that errors are small enough for linear approximation to be valid. (After this we’ll talk about how to remove that assumption.) We have some multivariate function f(x) - imagine that x is the full state of our simulation at some timestep, and f calculates the state at the next timestep. The value ¯¯¯x of x in our program is really just an estimate of the “true” value x; it has some error Δx=x−¯¯¯x. As a result, the value of ¯¯¯f of f in our program also has some error Δf=f−¯¯¯f. Assuming the error is small enough for linear approximation to hold, we have:
Δf=f−¯¯¯f=f(¯¯¯x+Δx)−f(¯¯¯x)≈(dfdx|¯¯¯x)Δx
where dfdx is the Jacobian, i.e. the matrix of derivatives of every entry of f(x) with respect to every entry of x.
Next, assume that Δx has covariance matrix Σx, and we want to compute the covariance matrix Σf of Δf. We have a linear relationship between Δx and Δf, so we use the usual formula for linear transformation of covariance:
Σf=dfdxTΣxdfdx
Now imagine iterating this at every timestep: we compute the timestep itself, then differentiate that timestep, and matrix multiply our previous uncertainty on both sides by the derivative matrix to get the new uncertainty:
¯¯¯x(t+1)=f(¯¯¯x(t))
Σx(t+1)=(dfdx|¯¯¯x(t))TΣx(t)(dfdx|¯¯¯x(t))
Now, a few key things to note:
For most systems of interest, that uncertainty is going to grow over time, usually exponentially. That’s correct: in a chaotic system, if the initial conditions are uncertain, then of course we should become more and more uncertain about the system’s state over time.
Those formulas only propagate uncertainty in previous state to uncertainty in the next state. Really, there’s also new uncertainty introduced at each timestep, e.g. from error in f itself (i.e. due to averaging) or from whatever’s driving the system. Typically, such errors are introduced as an additive term—i.e. we compute the covariance in x introduced by each source of error, and add them to the propagated covariance matrix at each timestep.
Actually storing the whole covariance matrix would take O(n2) space if x has n elements, which is completely impractical when x is the whole state of a finite element simulation. We make this practical the same way we make all matrix operations practical in numerical computing: exploit sparsity/structure. This is application-specific, but usually the covariance can be well-approximated as the sum of sparse “local” covariances and low-rank “global” covariances.
Likewise with the update: we don’t actually want to compute the n-by-n derivative matrix and then matrix-multiply with the covariance. Most backpropagation libraries expose the derivative as a linear operator rather than an explicit matrix, and we want to use it that way. Again, specifics will vary, depending on the structure of f and of the (approximated) covariance matrix.
In many applications, we have data coming in over time. That data reduces our uncertainty every time it comes in—at that point, we effectively have a Kalman filter. If enough data is available, the uncertainty remains small enough for the linear approximation to continue to hold, and whole thing works great.
If the uncertainty does become too large for linear approximation, then we need to resort to other methods for representing uncertainty, rather than just a covariance matrix. Particle filters are one simple-but-effective fallback, and can be combined with linear uncertainty as well.
In general, if this sounds interesting and you want to know more, it’s covered in a lot of different contexts. I first saw most of it in an autonomous vehicles course; besides robotics, it’s also heavily used in economic models, and sometimes systems/control theory courses will focus on this sort of stuff.
Is this starting to sound like a model for which the observed data would have nonzero probability?
Do you mean you’d be adding the probability distribution with that covariance matrix on top of the mean prediction from f, to make it a probabilistic prediction? I was talking about deterministic predictions before, though my text doesn’t make that clear. For probabilistic models, yes adding an uncertainty distribution may make result in non-zero likelihoods. But if we know the true dynamics are deterministic (pretend there’s no quantum effects, which are largely irrelevant for our prediction errors for systems in the classical physics domain), then we still know the model is not true, and so it seems difficult to interpret p if we were to do Bayesian updating.
Likelihoods are also not obviously (to me) very good measures of model quality for chaotic systems, either—in these cases we know that even if we had the true model, its predictions would diverge from reality due to errors in the initial condition estimates, but it would trace out the correct attractor—and its the attractor geometry (conditional on boundary conditions) that we’d seem to really want to assess. Perhaps then it would have a higher likelihood than every other model, but it’s not obvious to me, and it’s not obvious that there’s not a better metric for leading to good inferences when we don’t have the true model.
Basically the logic that says to use Bayes for deducing the truth does not seem to carry over in an obvious way (to me) to the case when we want to predict but can’t use the true model.
Ah, that’s where we need to apply more Bayes. The underlying system may be deterministic at the macroscopic level, but that does not mean we have perfect knowledge of all the things which effect the system’s trajectory. Most of the uncertainty in e.g. a weather model would not be quantum noise, it would be things like initial conditions, measurement noise (e.g. how close is this measurement to the actual average over this whole volume?), approximation errors (e.g. from discretization of the dynamics), driving conditions (are we accounting for small variations in sunlight or tidal forces?), etc. The true dynamics may be deterministic, but that doesn’t mean that our estimates of all the things which go into those dynamics have no uncertainty. If the inputs have uncertainty (which of course they do), then the outputs also have uncertainty.
The main point of probabilistic models is not to handle “random” behavior in the environment, it’s to quantify uncertainty resulting from our own (lack of) knowledge of the system’s inputs/parameters.
Yeah, you’re pointing to an important issue here, although it’s not actually likelihoods which are the problem—it’s point estimates. In particular, that makes linear approximations a potential issue, since they’re implicitly approximations around a point estimate. Something like a particle filter will do a much better job than a Kalman filter at tracing out an attractor, since it accounts for nonlinearity much better.
Anyway, reasoning with likelihoods and posterior distributions remains valid regardless of whether we’re using point estimates. When the system is chaotic but has an attractor, the posterior probability of the system state will end up smeared pretty evenly over the whole attractor. (Although with enough fine-grained data, we can keep track of roughly where on the attractor the system is at each time, which is why Kalman-type models work well in that case.)