For the models we could afford to run, it seems to me that no choice of initial conditions would lead them to match the data we observe, except by extreme coincidence (analogous to a simple polynomial just happening to pass through all the datapoints produced by a much more complex function).
Ok, let’s talk about computing with error bars, because it sounds like that’s what’s missing from what you’re picturing.
The usual starting point is linear error—we assume that errors are small enough for linear approximation to be valid. (After this we’ll talk about how to remove that assumption.) We have some multivariate function f(x) - imagine that x is the full state of our simulation at some timestep, and f calculates the state at the next timestep. The value ¯¯¯x of x in our program is really just an estimate of the “true” value x; it has some error Δx=x−¯¯¯x. As a result, the value of ¯¯¯f of f in our program also has some error Δf=f−¯¯¯f. Assuming the error is small enough for linear approximation to hold, we have:
Δf=f−¯¯¯f=f(¯¯¯x+Δx)−f(¯¯¯x)≈(dfdx|¯¯¯x)Δx
where dfdx is the Jacobian, i.e. the matrix of derivatives of every entry of f(x) with respect to every entry of x.
Next, assume that Δx has covariance matrix Σx, and we want to compute the covariance matrix Σf of Δf. We have a linear relationship between Δx and Δf, so we use the usual formula for linear transformation of covariance:
Σf=dfdxTΣxdfdx
Now imagine iterating this at every timestep: we compute the timestep itself, then differentiate that timestep, and matrix multiply our previous uncertainty on both sides by the derivative matrix to get the new uncertainty:
¯¯¯x(t+1)=f(¯¯¯x(t))
Σx(t+1)=(dfdx|¯¯¯x(t))TΣx(t)(dfdx|¯¯¯x(t))
Now, a few key things to note:
For most systems of interest, that uncertainty is going to grow over time, usually exponentially. That’s correct: in a chaotic system, if the initial conditions are uncertain, then of course we should become more and more uncertain about the system’s state over time.
Those formulas only propagate uncertainty in previous state to uncertainty in the next state. Really, there’s also new uncertainty introduced at each timestep, e.g. from error in f itself (i.e. due to averaging) or from whatever’s driving the system. Typically, such errors are introduced as an additive term—i.e. we compute the covariance in x introduced by each source of error, and add them to the propagated covariance matrix at each timestep.
Actually storing the whole covariance matrix would take O(n2) space if x has n elements, which is completely impractical when x is the whole state of a finite element simulation. We make this practical the same way we make all matrix operations practical in numerical computing: exploit sparsity/structure. This is application-specific, but usually the covariance can be well-approximated as the sum of sparse “local” covariances and low-rank “global” covariances.
Likewise with the update: we don’t actually want to compute the n-by-n derivative matrix and then matrix-multiply with the covariance. Most backpropagation libraries expose the derivative as a linear operator rather than an explicit matrix, and we want to use it that way. Again, specifics will vary, depending on the structure of f and of the (approximated) covariance matrix.
In many applications, we have data coming in over time. That data reduces our uncertainty every time it comes in—at that point, we effectively have a Kalman filter. If enough data is available, the uncertainty remains small enough for the linear approximation to continue to hold, and whole thing works great.
If the uncertainty does become too large for linear approximation, then we need to resort to other methods for representing uncertainty, rather than just a covariance matrix. Particle filters are one simple-but-effective fallback, and can be combined with linear uncertainty as well.
In general, if this sounds interesting and you want to know more, it’s covered in a lot of different contexts. I first saw most of it in an autonomous vehicles course; besides robotics, it’s also heavily used in economic models, and sometimes systems/control theory courses will focus on this sort of stuff.
Is this starting to sound like a model for which the observed data would have nonzero probability?
Do you mean you’d be adding the probability distribution with that covariance matrix on top of the mean prediction from f, to make it a probabilistic prediction? I was talking about deterministic predictions before, though my text doesn’t make that clear. For probabilistic models, yes adding an uncertainty distribution may make result in non-zero likelihoods. But if we know the true dynamics are deterministic (pretend there’s no quantum effects, which are largely irrelevant for our prediction errors for systems in the classical physics domain), then we still know the model is not true, and so it seems difficult to interpret p if we were to do Bayesian updating.
Likelihoods are also not obviously (to me) very good measures of model quality for chaotic systems, either—in these cases we know that even if we had the true model, its predictions would diverge from reality due to errors in the initial condition estimates, but it would trace out the correct attractor—and its the attractor geometry (conditional on boundary conditions) that we’d seem to really want to assess. Perhaps then it would have a higher likelihood than every other model, but it’s not obvious to me, and it’s not obvious that there’s not a better metric for leading to good inferences when we don’t have the true model.
Basically the logic that says to use Bayes for deducing the truth does not seem to carry over in an obvious way (to me) to the case when we want to predict but can’t use the true model.
But if we know the true dynamics are deterministic (pretend there’s no quantum effects, which are largely irrelevant for our prediction errors for systems in the classical physics domain), then we still know the model is not true, and so it seems difficult to interpret p if we were to do Bayesian updating.
Ah, that’s where we need to apply more Bayes. The underlying system may be deterministic at the macroscopic level, but that does not mean we have perfect knowledge of all the things which effect the system’s trajectory. Most of the uncertainty in e.g. a weather model would not be quantum noise, it would be things like initial conditions, measurement noise (e.g. how close is this measurement to the actual average over this whole volume?), approximation errors (e.g. from discretization of the dynamics), driving conditions (are we accounting for small variations in sunlight or tidal forces?), etc. The true dynamics may be deterministic, but that doesn’t mean that our estimates of all the things which go into those dynamics have no uncertainty. If the inputs have uncertainty (which of course they do), then the outputs also have uncertainty.
The main point of probabilistic models is not to handle “random” behavior in the environment, it’s to quantify uncertainty resulting from our own (lack of) knowledge of the system’s inputs/parameters.
Likelihoods are also not obviously (to me) very good measures of model quality for chaotic systems, either—in these cases we know that even if we had the true model, its predictions would diverge from reality due to errors in the initial condition estimates, but it would trace out the correct attractor...
Yeah, you’re pointing to an important issue here, although it’s not actually likelihoods which are the problem—it’s point estimates. In particular, that makes linear approximations a potential issue, since they’re implicitly approximations around a point estimate. Something like a particle filter will do a much better job than a Kalman filter at tracing out an attractor, since it accounts for nonlinearity much better.
Anyway, reasoning with likelihoods and posterior distributions remains valid regardless of whether we’re using point estimates. When the system is chaotic but has an attractor, the posterior probability of the system state will end up smeared pretty evenly over the whole attractor. (Although with enough fine-grained data, we can keep track of roughly where on the attractor the system is at each time, which is why Kalman-type models work well in that case.)
Ok, let’s talk about computing with error bars, because it sounds like that’s what’s missing from what you’re picturing.
The usual starting point is linear error—we assume that errors are small enough for linear approximation to be valid. (After this we’ll talk about how to remove that assumption.) We have some multivariate function f(x) - imagine that x is the full state of our simulation at some timestep, and f calculates the state at the next timestep. The value ¯¯¯x of x in our program is really just an estimate of the “true” value x; it has some error Δx=x−¯¯¯x. As a result, the value of ¯¯¯f of f in our program also has some error Δf=f−¯¯¯f. Assuming the error is small enough for linear approximation to hold, we have:
Δf=f−¯¯¯f=f(¯¯¯x+Δx)−f(¯¯¯x)≈(dfdx|¯¯¯x)Δx
where dfdx is the Jacobian, i.e. the matrix of derivatives of every entry of f(x) with respect to every entry of x.
Next, assume that Δx has covariance matrix Σx, and we want to compute the covariance matrix Σf of Δf. We have a linear relationship between Δx and Δf, so we use the usual formula for linear transformation of covariance:
Σf=dfdxTΣxdfdx
Now imagine iterating this at every timestep: we compute the timestep itself, then differentiate that timestep, and matrix multiply our previous uncertainty on both sides by the derivative matrix to get the new uncertainty:
¯¯¯x(t+1)=f(¯¯¯x(t))
Σx(t+1)=(dfdx|¯¯¯x(t))TΣx(t)(dfdx|¯¯¯x(t))
Now, a few key things to note:
For most systems of interest, that uncertainty is going to grow over time, usually exponentially. That’s correct: in a chaotic system, if the initial conditions are uncertain, then of course we should become more and more uncertain about the system’s state over time.
Those formulas only propagate uncertainty in previous state to uncertainty in the next state. Really, there’s also new uncertainty introduced at each timestep, e.g. from error in f itself (i.e. due to averaging) or from whatever’s driving the system. Typically, such errors are introduced as an additive term—i.e. we compute the covariance in x introduced by each source of error, and add them to the propagated covariance matrix at each timestep.
Actually storing the whole covariance matrix would take O(n2) space if x has n elements, which is completely impractical when x is the whole state of a finite element simulation. We make this practical the same way we make all matrix operations practical in numerical computing: exploit sparsity/structure. This is application-specific, but usually the covariance can be well-approximated as the sum of sparse “local” covariances and low-rank “global” covariances.
Likewise with the update: we don’t actually want to compute the n-by-n derivative matrix and then matrix-multiply with the covariance. Most backpropagation libraries expose the derivative as a linear operator rather than an explicit matrix, and we want to use it that way. Again, specifics will vary, depending on the structure of f and of the (approximated) covariance matrix.
In many applications, we have data coming in over time. That data reduces our uncertainty every time it comes in—at that point, we effectively have a Kalman filter. If enough data is available, the uncertainty remains small enough for the linear approximation to continue to hold, and whole thing works great.
If the uncertainty does become too large for linear approximation, then we need to resort to other methods for representing uncertainty, rather than just a covariance matrix. Particle filters are one simple-but-effective fallback, and can be combined with linear uncertainty as well.
In general, if this sounds interesting and you want to know more, it’s covered in a lot of different contexts. I first saw most of it in an autonomous vehicles course; besides robotics, it’s also heavily used in economic models, and sometimes systems/control theory courses will focus on this sort of stuff.
Is this starting to sound like a model for which the observed data would have nonzero probability?
Do you mean you’d be adding the probability distribution with that covariance matrix on top of the mean prediction from f, to make it a probabilistic prediction? I was talking about deterministic predictions before, though my text doesn’t make that clear. For probabilistic models, yes adding an uncertainty distribution may make result in non-zero likelihoods. But if we know the true dynamics are deterministic (pretend there’s no quantum effects, which are largely irrelevant for our prediction errors for systems in the classical physics domain), then we still know the model is not true, and so it seems difficult to interpret p if we were to do Bayesian updating.
Likelihoods are also not obviously (to me) very good measures of model quality for chaotic systems, either—in these cases we know that even if we had the true model, its predictions would diverge from reality due to errors in the initial condition estimates, but it would trace out the correct attractor—and its the attractor geometry (conditional on boundary conditions) that we’d seem to really want to assess. Perhaps then it would have a higher likelihood than every other model, but it’s not obvious to me, and it’s not obvious that there’s not a better metric for leading to good inferences when we don’t have the true model.
Basically the logic that says to use Bayes for deducing the truth does not seem to carry over in an obvious way (to me) to the case when we want to predict but can’t use the true model.
Ah, that’s where we need to apply more Bayes. The underlying system may be deterministic at the macroscopic level, but that does not mean we have perfect knowledge of all the things which effect the system’s trajectory. Most of the uncertainty in e.g. a weather model would not be quantum noise, it would be things like initial conditions, measurement noise (e.g. how close is this measurement to the actual average over this whole volume?), approximation errors (e.g. from discretization of the dynamics), driving conditions (are we accounting for small variations in sunlight or tidal forces?), etc. The true dynamics may be deterministic, but that doesn’t mean that our estimates of all the things which go into those dynamics have no uncertainty. If the inputs have uncertainty (which of course they do), then the outputs also have uncertainty.
The main point of probabilistic models is not to handle “random” behavior in the environment, it’s to quantify uncertainty resulting from our own (lack of) knowledge of the system’s inputs/parameters.
Yeah, you’re pointing to an important issue here, although it’s not actually likelihoods which are the problem—it’s point estimates. In particular, that makes linear approximations a potential issue, since they’re implicitly approximations around a point estimate. Something like a particle filter will do a much better job than a Kalman filter at tracing out an attractor, since it accounts for nonlinearity much better.
Anyway, reasoning with likelihoods and posterior distributions remains valid regardless of whether we’re using point estimates. When the system is chaotic but has an attractor, the posterior probability of the system state will end up smeared pretty evenly over the whole attractor. (Although with enough fine-grained data, we can keep track of roughly where on the attractor the system is at each time, which is why Kalman-type models work well in that case.)