the peak of a flat minimum is a slightly better approximation for the posterior predictive distribution over the entire hypothesis class. Sometimes I even wonder if something like this explains why Occam’s Razor works...
That’s exactly correct. You can prove it via the Laplace approximation: the “width” of the peak in each principal direction is the inverse of an eigenvalue of the Hessian, and each eigenvalue λi contributes −12log(λi) to the marginal log likelihood logP[data|model]. So, if a peak is twice as wide in one direction, its marginal log likelihood is higher by 12log(2), or half a bit. For models in which the number of free parameters is large relative to the number of data points (i.e. the interesting part of the double-descent curve), this is the main term of interest in the marginal log likelihood.
In Jaynes’ Logic of Science book, in the chapter on model comparison, I believe he directly claims that this is how/why Occam’s Razor works.
That said, it’s not clear that this particular proof would generalize properly to systems which perfectly fit the data. In that case, there’s an entire surface on which P[data|θ] is constant, so Laplace needs some tweaking.
That’s exactly correct. You can prove it via the Laplace approximation: the “width” of the peak in each principal direction is the inverse of an eigenvalue of the Hessian, and each eigenvalue λi contributes −12log(λi) to the marginal log likelihood logP[data|model]. So, if a peak is twice as wide in one direction, its marginal log likelihood is higher by 12log(2), or half a bit. For models in which the number of free parameters is large relative to the number of data points (i.e. the interesting part of the double-descent curve), this is the main term of interest in the marginal log likelihood.
In Jaynes’ Logic of Science book, in the chapter on model comparison, I believe he directly claims that this is how/why Occam’s Razor works.
That said, it’s not clear that this particular proof would generalize properly to systems which perfectly fit the data. In that case, there’s an entire surface on which P[data|θ] is constant, so Laplace needs some tweaking.