Indeed, it’s entirely possible that the training data and the test data are of qualitatively different types, drawn from entirely different distributions. A Bayesian method with a well-informed model can often work well in such circumstances. In that case, the performance on the training and test sets aren’t even comparable-in-principle.
For instance, we could have some experiment trying to measure the gravitational constant, and use a Bayesian model to estimate the constant from whatever data we’ve collected. Our “test data” is then the “true” value of G, as measured by better experiments than ours. Here, we can compare our expected performance to actual performance, but there’s no notion of performance comparison between train and test.
I think this is beyond the scope of what the post is trying to address. One of the stated assumptions is:
The data is independent and identically distributed and comes separated in a training set and a test set.
In that case, a naive estimate of the expected test loss would be the average training loss using samples of the posterior. The author shows that this is an underestimate and gives us a much better alternative in the form of the WAIC.
a naive estimate of the expected test loss would be the average training loss using samples of the posterior.
That’s exactly the problem—that is generally not a good estimate of the expected test loss. It isn’t even an unbiased estimate. It’s just completely wrong.
The right way to do this is to just calculate the expected test loss.
I think this is beyond the scope of what the post is trying to address. One of the stated assumptions is:
In that case, a naive estimate of the expected test loss would be the average training loss using samples of the posterior. The author shows that this is an underestimate and gives us a much better alternative in the form of the WAIC.
That’s exactly the problem—that is generally not a good estimate of the expected test loss. It isn’t even an unbiased estimate. It’s just completely wrong.
The right way to do this is to just calculate the expected test loss.