The way I’d phrase the theoretical problem when you fit a model to a distribution (e.g. minimizing KL-divergence on a set of samples), you can often prove theorems of the form “the fitted-distribution has such-and-such relationship to the true distribution”, e.g. you can compute confidence intervals for parameters and predictions in linear regression.
Often, all that is sufficient for those theorems to hold is:
The model is at an optimum
The model is flexible enough
The sample size is big enough
… because then if you have some point X you want to make predictions for, the sample size being big enough means you have a whole bunch of points in the empirical distribution that are similar to X. These points affect the loss landscape, and because you’ve got a flexible optimal model, that forces the model to approximate them well enough.
But this “you’ve got a bunch of empirical points dragging the loss around in relevant ways” part only works on-distribution, because you don’t have a bunch of empirical points from off-distribution data. Even if technically they form an exponentially small slice of the true distribution, this means they only have an exponentially small effect on the loss function, and therefore being at an optimal loss is exponentially weakly informative about these points.
(Obviously this is somewhat complicated by overfitting, double descent, etc., but I think the gist of the argument goes through.)
I guess it depends on whether one makes the cut between theory and practice with or without assuming that one has learned the distribution? I.e. I’m saying if you have a distribution D, take some samples E, and approximate E with Q, then you might be able to prove that samples from Q are similar to samples from D, but you can’t prove that conditioning on something exponentially unlikely in D gives you something reasonable in Q. Meanwhile you’re saying that conditioning on something exponentially unlikely in D is tantamount to optimization.
The way I’d phrase the theoretical problem when you fit a model to a distribution (e.g. minimizing KL-divergence on a set of samples), you can often prove theorems of the form “the fitted-distribution has such-and-such relationship to the true distribution”, e.g. you can compute confidence intervals for parameters and predictions in linear regression.
Often, all that is sufficient for those theorems to hold is:
The model is at an optimum
The model is flexible enough
The sample size is big enough
… because then if you have some point X you want to make predictions for, the sample size being big enough means you have a whole bunch of points in the empirical distribution that are similar to X. These points affect the loss landscape, and because you’ve got a flexible optimal model, that forces the model to approximate them well enough.
But this “you’ve got a bunch of empirical points dragging the loss around in relevant ways” part only works on-distribution, because you don’t have a bunch of empirical points from off-distribution data. Even if technically they form an exponentially small slice of the true distribution, this means they only have an exponentially small effect on the loss function, and therefore being at an optimal loss is exponentially weakly informative about these points.
(Obviously this is somewhat complicated by overfitting, double descent, etc., but I think the gist of the argument goes through.)
I guess it depends on whether one makes the cut between theory and practice with or without assuming that one has learned the distribution? I.e. I’m saying if you have a distribution D, take some samples E, and approximate E with Q, then you might be able to prove that samples from Q are similar to samples from D, but you can’t prove that conditioning on something exponentially unlikely in D gives you something reasonable in Q. Meanwhile you’re saying that conditioning on something exponentially unlikely in D is tantamount to optimization.