Briefly, if instead of using performance on a test set to judge future performance on real data, or to compare two models, you instead use a formal Bayesian approach that looks only at the training data, the quality of the answers from this formal Bayesian approach may depend very crucially on getting the Bayesian model specification (including priors for parameters) almost exactly right (in the sense of expression your true prior knowledge of the problem). And getting it that close to exactly right may be beyond your ability.
And in any case, we all know that there is a non-negligible chance that your program to do the Bayesian computations simply has a bug. So seeing how well you do on a held-out test set before launching your system to Jupiter is a good idea.
Thanks for your reply Professor Neal. Why does it make sense to use the test-set performance to judge performance on an arbitrary/future dataset? Does the test-set have some other interpretation that I am missing? If we wanted to judge future performance on real data or compare two model by future performance on real data then shouldn’t we just calculate the most likely performance on a arbitrary dataset?
I’m not sure what you’re asking here. The test set should of course be drawn from the same distribution as the future cases you actually care about. In practice, it can sometimes be hard to ensure that. But judging by performance on an arbitrary data set isn’t an option, since performance in the future does depend on what data shows up in the future (for a classification problem, on both the inputs, and of course on the class labels). I think I’m missing what you’re getting at....
See my comment at https://www.lesswrong.com/posts/Mw9if9wbfTBawynGs/?commentId=uEtuotdC6oD2H2FPw on a closely related question.
Briefly, if instead of using performance on a test set to judge future performance on real data, or to compare two models, you instead use a formal Bayesian approach that looks only at the training data, the quality of the answers from this formal Bayesian approach may depend very crucially on getting the Bayesian model specification (including priors for parameters) almost exactly right (in the sense of expression your true prior knowledge of the problem). And getting it that close to exactly right may be beyond your ability.
And in any case, we all know that there is a non-negligible chance that your program to do the Bayesian computations simply has a bug. So seeing how well you do on a held-out test set before launching your system to Jupiter is a good idea.
Thanks for your reply Professor Neal. Why does it make sense to use the test-set performance to judge performance on an arbitrary/future dataset? Does the test-set have some other interpretation that I am missing? If we wanted to judge future performance on real data or compare two model by future performance on real data then shouldn’t we just calculate the most likely performance on a arbitrary dataset?
I’m not sure what you’re asking here. The test set should of course be drawn from the same distribution as the future cases you actually care about. In practice, it can sometimes be hard to ensure that. But judging by performance on an arbitrary data set isn’t an option, since performance in the future does depend on what data shows up in the future (for a classification problem, on both the inputs, and of course on the class labels). I think I’m missing what you’re getting at....