So, I use cross-validation to choose a model. Then I use the same data to train the model. Insanity ensues.
Why? A model has two components: the hyperparameters and the parameters. The hyperparameters are inputs to the model, and the parameters are calculated from the hyperparameters and the training data. (This is a very similar approach to what are called ‘hierarchical Bayesian models.’)
Instead of pulling a prior out of thin air for the hyperparameters, this asks the question “which hyperparameters are best for generalizing models to test sets outside the training set?”, which is a different question from “which parameters maximize the likelihood of this data?”
(I should add that some people call it ‘cross-tuning’ to report a model whose hyperparameters have been selected by this sort of process, if there’s no third dataset used for testing that was not used for tuning. Standard practice in ML is to still refer to it as ‘cross-validation.’)
Besides, even cross-validation for model-selection is suspicious. Shouldn’t I, ideally, train all model with all the data and form a posterior on the most probable values?
If you do this, how will you get an estimate of how well your model is able to predict outside of the training set?
But once they do have the hyperparameter in place, this is what they do—they fit the model on the full training data, so that they can make the most use of everything.
Why? A model has two components: the hyperparameters and the parameters. The hyperparameters are inputs to the model, and the parameters are calculated from the hyperparameters and the training data. (This is a very similar approach to what are called ‘hierarchical Bayesian models.’)
Instead of pulling a prior out of thin air for the hyperparameters, this asks the question “which hyperparameters are best for generalizing models to test sets outside the training set?”, which is a different question from “which parameters maximize the likelihood of this data?”
(I should add that some people call it ‘cross-tuning’ to report a model whose hyperparameters have been selected by this sort of process, if there’s no third dataset used for testing that was not used for tuning. Standard practice in ML is to still refer to it as ‘cross-validation.’)
If you do this, how will you get an estimate of how well your model is able to predict outside of the training set?
But once they do have the hyperparameter in place, this is what they do—they fit the model on the full training data, so that they can make the most use of everything.