For small datasets, the PAC-Bayes bounds suffer because they scale as sqrt(KL/N)
I agree with you about the current PAC-Bayes bounds, but there are other results which I think are more powerful and useful.
Not sure if I agree regarding the real-world usefulness. For the non-IID case, PAC-Bayes bounds fail, and to re-instate them you’d need assumptions about how quickly the distribution changes, but then it’s plausible that you could get high probability bounds based on the most recent performance.
I think you make even looser assumptions than that, as how quickly and the way in which the distribution changes are themselves quantities can be estimated. I wouldn’t be surprised if you could get some very general results by using equally expressive time series models.
That piece you link uses a definition of overfitting which doesn’t really make sense from a Bayesian perspective. “The difference between the performance on the training set and the performance on the test set” is not what we care about; we care about the difference between the expected performance on the test set and the actual performance on the test set.
Indeed, it’s entirely possible that the training data and the test data are of qualitatively different types, drawn from entirely different distributions. A Bayesian method with a well-informed model can often work well in such circumstances. In that case, the performance on the training and test sets aren’t even comparable-in-principle.
For instance, we could have some experiment trying to measure the gravitational constant, and use a Bayesian model to estimate the constant from whatever data we’ve collected. Our “test data” is then the “true” value of G, as measured by better experiments than ours. Here, we can compare our expected performance to actual performance, but there’s no notion of performance comparison between train and test.
Indeed, it’s entirely possible that the training data and the test data are of qualitatively different types, drawn from entirely different distributions. A Bayesian method with a well-informed model can often work well in such circumstances. In that case, the performance on the training and test sets aren’t even comparable-in-principle.
For instance, we could have some experiment trying to measure the gravitational constant, and use a Bayesian model to estimate the constant from whatever data we’ve collected. Our “test data” is then the “true” value of G, as measured by better experiments than ours. Here, we can compare our expected performance to actual performance, but there’s no notion of performance comparison between train and test.
I think this is beyond the scope of what the post is trying to address. One of the stated assumptions is:
The data is independent and identically distributed and comes separated in a training set and a test set.
In that case, a naive estimate of the expected test loss would be the average training loss using samples of the posterior. The author shows that this is an underestimate and gives us a much better alternative in the form of the WAIC.
a naive estimate of the expected test loss would be the average training loss using samples of the posterior.
That’s exactly the problem—that is generally not a good estimate of the expected test loss. It isn’t even an unbiased estimate. It’s just completely wrong.
The right way to do this is to just calculate the expected test loss.
That’s an interesting link. It sound like the results can only be applied to strictly Bayesian methods though, so they couldn’t be applied to neural networks as they exist now.
There is some progress in that direction though. The bigger problem, as mentioned in the link, it is that that estimator seems to completely break down if you try and use an approximation to the posterior although there seems to be ongoing work to estimate generalisation error just from MCMC samples.
I agree with you about the current PAC-Bayes bounds, but there are other results which I think are more powerful and useful.
I think you make even looser assumptions than that, as how quickly and the way in which the distribution changes are themselves quantities can be estimated. I wouldn’t be surprised if you could get some very general results by using equally expressive time series models.
That piece you link uses a definition of overfitting which doesn’t really make sense from a Bayesian perspective. “The difference between the performance on the training set and the performance on the test set” is not what we care about; we care about the difference between the expected performance on the test set and the actual performance on the test set.
Indeed, it’s entirely possible that the training data and the test data are of qualitatively different types, drawn from entirely different distributions. A Bayesian method with a well-informed model can often work well in such circumstances. In that case, the performance on the training and test sets aren’t even comparable-in-principle.
For instance, we could have some experiment trying to measure the gravitational constant, and use a Bayesian model to estimate the constant from whatever data we’ve collected. Our “test data” is then the “true” value of G, as measured by better experiments than ours. Here, we can compare our expected performance to actual performance, but there’s no notion of performance comparison between train and test.
I think this is beyond the scope of what the post is trying to address. One of the stated assumptions is:
In that case, a naive estimate of the expected test loss would be the average training loss using samples of the posterior. The author shows that this is an underestimate and gives us a much better alternative in the form of the WAIC.
That’s exactly the problem—that is generally not a good estimate of the expected test loss. It isn’t even an unbiased estimate. It’s just completely wrong.
The right way to do this is to just calculate the expected test loss.
That’s an interesting link. It sound like the results can only be applied to strictly Bayesian methods though, so they couldn’t be applied to neural networks as they exist now.
There is some progress in that direction though. The bigger problem, as mentioned in the link, it is that that estimator seems to completely break down if you try and use an approximation to the posterior although there seems to be ongoing work to estimate generalisation error just from MCMC samples.