Frequentists are apparently afraid of the possibility that “subjectivity”—that thing they were accusing Bayesians of—could allow some unspecified terrifying abuse of the scientific process. Do I need to point out the general implications of being allowed to throw away your actual experimental results and substitute a class you made up?
The demon you’re describing can’t be exorcised simply by switching from frequentism to Bayesianism—it torments the Bayesians as well. It is more an issue of intellectual honesty than of statistical paradigm.
A Bayesian falls into the trap by choosing a prior after the data is observed. Say you have a data set partitioned into a training set and a test set. You think the data is well described by a model class M. So you take the training set and run a learning algorithm, get some good parameters, and then use the learned model to make predictions on the test set. You fail. Hmmm, you say, back to the drawing board, and cook up a different model M’. You repeat the process with M’ and voila! - accurate predictions on the test set. Time to declare victory, right? Wrong. You haven’t proved anything, because you looked at the test data.
In my view, no one has really understood the true magnitude of this problem—no one has looked the demon in the eye without blinking. For example, on this page you can find a listing of the best results achieved on the MNIST handwritten digit benchmark. You can see how the results get better and better. But this improvement doesn’t mean anything, because the selection is decided based on how well the algorithms do on the test data! In other words, the machine learning community implements overfitting.
The demon you’re describing can’t be exorcised simply by switching from frequentism to Bayesianism—it torments the Bayesians as well. It is more an issue of intellectual honesty than of statistical paradigm.
A Bayesian falls into the trap by choosing a prior after the data is observed. Say you have a data set partitioned into a training set and a test set. You think the data is well described by a model class M. So you take the training set and run a learning algorithm, get some good parameters, and then use the learned model to make predictions on the test set. You fail. Hmmm, you say, back to the drawing board, and cook up a different model M’. You repeat the process with M’ and voila! - accurate predictions on the test set. Time to declare victory, right? Wrong. You haven’t proved anything, because you looked at the test data.
In my view, no one has really understood the true magnitude of this problem—no one has looked the demon in the eye without blinking. For example, on this page you can find a listing of the best results achieved on the MNIST handwritten digit benchmark. You can see how the results get better and better. But this improvement doesn’t mean anything, because the selection is decided based on how well the algorithms do on the test data! In other words, the machine learning community implements overfitting.
I thought the Netflix Prize did a pretty good job of handling this.