Deriving bounds on the generalization error might seem pointless when it’s easy to do this by just holding out a validation set. I think the main value is in providing a test of purported theories: your ‘explanation’ for why neural networks generalize ought to be able to produce non-trivial bounds on their generalization error.
I think there’s more value to the exercise than just that, it may be less useful in the iid case with lots of data where having a “validation set” makes sense, but there are many non-IID time series problems where effectively your “dataset” consists of one datapoint, and slicing out a “validation set” from this is sketchy at best, and a great recipe to watch an overconfident model fail catastrophically at worst. There are also situations where data is so scarce or expensive to collect that slicing out a validation or test set would prevent you from having enough data to fit a useful model. Being able to form generalisation bounds without relying on a validation set in non-IID situations is something that would be extremely useful for understanding the behaviour AI or AGI systems deployed in the “real world”.
Even if you can’t make general or solid bounds, understanding how those bound change as we add assumptions (Markov Property, IID, etc...) can tell us more about when it’s safe to deploy AI systems.
Not sure if I agree regarding the real-world usefulness. For the non-IID case, PAC-Bayes bounds fail, and to re-instate them you’d need assumptions about how quickly the distribution changes, but then it’s plausible that you could get high probability bounds based on the most recent performance. For small datasets, the PAC-Bayes bounds suffer because they scale as √KLN . (I may edit the post to be clearer about this)
Agreed that analyzing how the bounds change under different conditions could be insightful though. Ultimately I suspect that effective bounds will require powerful ways to extract ‘the signal from the noise’, and examining the signal will likely be useful for understanding if a model has truly learned what it is supposed to.
For small datasets, the PAC-Bayes bounds suffer because they scale as sqrt(KL/N)
I agree with you about the current PAC-Bayes bounds, but there are other results which I think are more powerful and useful.
Not sure if I agree regarding the real-world usefulness. For the non-IID case, PAC-Bayes bounds fail, and to re-instate them you’d need assumptions about how quickly the distribution changes, but then it’s plausible that you could get high probability bounds based on the most recent performance.
I think you make even looser assumptions than that, as how quickly and the way in which the distribution changes are themselves quantities can be estimated. I wouldn’t be surprised if you could get some very general results by using equally expressive time series models.
That piece you link uses a definition of overfitting which doesn’t really make sense from a Bayesian perspective. “The difference between the performance on the training set and the performance on the test set” is not what we care about; we care about the difference between the expected performance on the test set and the actual performance on the test set.
Indeed, it’s entirely possible that the training data and the test data are of qualitatively different types, drawn from entirely different distributions. A Bayesian method with a well-informed model can often work well in such circumstances. In that case, the performance on the training and test sets aren’t even comparable-in-principle.
For instance, we could have some experiment trying to measure the gravitational constant, and use a Bayesian model to estimate the constant from whatever data we’ve collected. Our “test data” is then the “true” value of G, as measured by better experiments than ours. Here, we can compare our expected performance to actual performance, but there’s no notion of performance comparison between train and test.
Indeed, it’s entirely possible that the training data and the test data are of qualitatively different types, drawn from entirely different distributions. A Bayesian method with a well-informed model can often work well in such circumstances. In that case, the performance on the training and test sets aren’t even comparable-in-principle.
For instance, we could have some experiment trying to measure the gravitational constant, and use a Bayesian model to estimate the constant from whatever data we’ve collected. Our “test data” is then the “true” value of G, as measured by better experiments than ours. Here, we can compare our expected performance to actual performance, but there’s no notion of performance comparison between train and test.
I think this is beyond the scope of what the post is trying to address. One of the stated assumptions is:
The data is independent and identically distributed and comes separated in a training set and a test set.
In that case, a naive estimate of the expected test loss would be the average training loss using samples of the posterior. The author shows that this is an underestimate and gives us a much better alternative in the form of the WAIC.
a naive estimate of the expected test loss would be the average training loss using samples of the posterior.
That’s exactly the problem—that is generally not a good estimate of the expected test loss. It isn’t even an unbiased estimate. It’s just completely wrong.
The right way to do this is to just calculate the expected test loss.
That’s an interesting link. It sound like the results can only be applied to strictly Bayesian methods though, so they couldn’t be applied to neural networks as they exist now.
There is some progress in that direction though. The bigger problem, as mentioned in the link, it is that that estimator seems to completely break down if you try and use an approximation to the posterior although there seems to be ongoing work to estimate generalisation error just from MCMC samples.
I think there’s more value to the exercise than just that, it may be less useful in the iid case with lots of data where having a “validation set” makes sense, but there are many non-IID time series problems where effectively your “dataset” consists of one datapoint, and slicing out a “validation set” from this is sketchy at best, and a great recipe to watch an overconfident model fail catastrophically at worst. There are also situations where data is so scarce or expensive to collect that slicing out a validation or test set would prevent you from having enough data to fit a useful model. Being able to form generalisation bounds without relying on a validation set in non-IID situations is something that would be extremely useful for understanding the behaviour AI or AGI systems deployed in the “real world”.
Even if you can’t make general or solid bounds, understanding how those bound change as we add assumptions (Markov Property, IID, etc...) can tell us more about when it’s safe to deploy AI systems.
Not sure if I agree regarding the real-world usefulness. For the non-IID case, PAC-Bayes bounds fail, and to re-instate them you’d need assumptions about how quickly the distribution changes, but then it’s plausible that you could get high probability bounds based on the most recent performance. For small datasets, the PAC-Bayes bounds suffer because they scale as √KLN . (I may edit the post to be clearer about this)
Agreed that analyzing how the bounds change under different conditions could be insightful though. Ultimately I suspect that effective bounds will require powerful ways to extract ‘the signal from the noise’, and examining the signal will likely be useful for understanding if a model has truly learned what it is supposed to.
I agree with you about the current PAC-Bayes bounds, but there are other results which I think are more powerful and useful.
I think you make even looser assumptions than that, as how quickly and the way in which the distribution changes are themselves quantities can be estimated. I wouldn’t be surprised if you could get some very general results by using equally expressive time series models.
That piece you link uses a definition of overfitting which doesn’t really make sense from a Bayesian perspective. “The difference between the performance on the training set and the performance on the test set” is not what we care about; we care about the difference between the expected performance on the test set and the actual performance on the test set.
Indeed, it’s entirely possible that the training data and the test data are of qualitatively different types, drawn from entirely different distributions. A Bayesian method with a well-informed model can often work well in such circumstances. In that case, the performance on the training and test sets aren’t even comparable-in-principle.
For instance, we could have some experiment trying to measure the gravitational constant, and use a Bayesian model to estimate the constant from whatever data we’ve collected. Our “test data” is then the “true” value of G, as measured by better experiments than ours. Here, we can compare our expected performance to actual performance, but there’s no notion of performance comparison between train and test.
I think this is beyond the scope of what the post is trying to address. One of the stated assumptions is:
In that case, a naive estimate of the expected test loss would be the average training loss using samples of the posterior. The author shows that this is an underestimate and gives us a much better alternative in the form of the WAIC.
That’s exactly the problem—that is generally not a good estimate of the expected test loss. It isn’t even an unbiased estimate. It’s just completely wrong.
The right way to do this is to just calculate the expected test loss.
That’s an interesting link. It sound like the results can only be applied to strictly Bayesian methods though, so they couldn’t be applied to neural networks as they exist now.
There is some progress in that direction though. The bigger problem, as mentioned in the link, it is that that estimator seems to completely break down if you try and use an approximation to the posterior although there seems to be ongoing work to estimate generalisation error just from MCMC samples.