For someone like me who grew into probability with Jaynes’ book, seeing in the first chapter that algorithms are trained using multiple times the same data (cross-validation) was… annoying, let’s say (I actually screamed at the book).
There’s two ways to train algorithms ‘multiple times’ on the same data. The bad one is data duplication, but cross-validation is the good one. Data duplication is the sort of thing that Jaynes would have been worried about, because it means you’re counting evidence from the same piece of data twice, thus your model has illusory precision.
But what does cross-validation do? There’s an issue called “overfitting,” where any statistical procedure performed on a training set will fit both the noise and the signal in the training set, but while the signal on a test set will presumably be the same, the noise will be different and thus the model will do worse. Single validation is when you split your data into two parts, the training set and the test set, so that you can see how well your model trained on the training set does on the test set. When there’s a tunable parameter in the training method, people will sometimes optimize the tunable parameter given data in the test set.*
But to do one split and leave it at that is wasteful. Cross-validation is when you partition the data many times, and fit many different models, and can thus talk about how the population of models behaves. In particular, consider the case of ‘leave-one-out’ cross-validation, where in a dataset of n points, we train n different models, each time using n-1 datapoints to fit the model parameters, test them on the 1 datapoint left out. This gives each individual model as much training data as possible while still leaving us a test dataset to determine how resilient to overfitting our model-generation procedure is.
* The principled way to do this is to split the data three times, into a training set (which the algorithm always has access to), a validation set (which the algorithm only has access to when setting the tunable parameters), and then a test set (which the algorithm never has access to, but is used to assess how well the model does after the tunable parameter has been optimized).
The training sample of size m is then used to compute the n-fold cross-validation error R_CV(θ) for a small number of possible values of θ. θ is next set to the value θ_0 for which R_CV(θ) is smallest and the algorithm is trained with the parameter setting θ_0 over the full training sample of size m
So, I use cross-validation to choose a model. Then I use the same data to train the model. Insanity ensues.
Besides, even cross-validation for model-selection is suspicious. Shouldn’t I, ideally, train all model with all the data and form a posterior on the most probable values?
So, I use cross-validation to choose a model. Then I use the same data to train the model. Insanity ensues.
Why? A model has two components: the hyperparameters and the parameters. The hyperparameters are inputs to the model, and the parameters are calculated from the hyperparameters and the training data. (This is a very similar approach to what are called ‘hierarchical Bayesian models.’)
Instead of pulling a prior out of thin air for the hyperparameters, this asks the question “which hyperparameters are best for generalizing models to test sets outside the training set?”, which is a different question from “which parameters maximize the likelihood of this data?”
(I should add that some people call it ‘cross-tuning’ to report a model whose hyperparameters have been selected by this sort of process, if there’s no third dataset used for testing that was not used for tuning. Standard practice in ML is to still refer to it as ‘cross-validation.’)
Besides, even cross-validation for model-selection is suspicious. Shouldn’t I, ideally, train all model with all the data and form a posterior on the most probable values?
If you do this, how will you get an estimate of how well your model is able to predict outside of the training set?
But once they do have the hyperparameter in place, this is what they do—they fit the model on the full training data, so that they can make the most use of everything.
There’s two ways to train algorithms ‘multiple times’ on the same data. The bad one is data duplication, but cross-validation is the good one. Data duplication is the sort of thing that Jaynes would have been worried about, because it means you’re counting evidence from the same piece of data twice, thus your model has illusory precision.
But what does cross-validation do? There’s an issue called “overfitting,” where any statistical procedure performed on a training set will fit both the noise and the signal in the training set, but while the signal on a test set will presumably be the same, the noise will be different and thus the model will do worse. Single validation is when you split your data into two parts, the training set and the test set, so that you can see how well your model trained on the training set does on the test set. When there’s a tunable parameter in the training method, people will sometimes optimize the tunable parameter given data in the test set.*
But to do one split and leave it at that is wasteful. Cross-validation is when you partition the data many times, and fit many different models, and can thus talk about how the population of models behaves. In particular, consider the case of ‘leave-one-out’ cross-validation, where in a dataset of n points, we train n different models, each time using n-1 datapoints to fit the model parameters, test them on the 1 datapoint left out. This gives each individual model as much training data as possible while still leaving us a test dataset to determine how resilient to overfitting our model-generation procedure is.
* The principled way to do this is to split the data three times, into a training set (which the algorithm always has access to), a validation set (which the algorithm only has access to when setting the tunable parameters), and then a test set (which the algorithm never has access to, but is used to assess how well the model does after the tunable parameter has been optimized).
Allow me to quote directly from the book:
So, I use cross-validation to choose a model. Then I use the same data to train the model. Insanity ensues.
Besides, even cross-validation for model-selection is suspicious. Shouldn’t I, ideally, train all model with all the data and form a posterior on the most probable values?
Why? A model has two components: the hyperparameters and the parameters. The hyperparameters are inputs to the model, and the parameters are calculated from the hyperparameters and the training data. (This is a very similar approach to what are called ‘hierarchical Bayesian models.’)
Instead of pulling a prior out of thin air for the hyperparameters, this asks the question “which hyperparameters are best for generalizing models to test sets outside the training set?”, which is a different question from “which parameters maximize the likelihood of this data?”
(I should add that some people call it ‘cross-tuning’ to report a model whose hyperparameters have been selected by this sort of process, if there’s no third dataset used for testing that was not used for tuning. Standard practice in ML is to still refer to it as ‘cross-validation.’)
If you do this, how will you get an estimate of how well your model is able to predict outside of the training set?
But once they do have the hyperparameter in place, this is what they do—they fit the model on the full training data, so that they can make the most use of everything.