The fear seems to be that people might propose theories with enough degrees of freedom that they can fine-tune it to fit the complete data very closely. But, as long as the fitting process is repeatable, i.e., no numerology, cross-validation can be applied to discover which theories are genuinely predictive and which are over-fitting.
we then take an independent sample of validation data from the same population as the training data, it will generally turn out that the model does not fit the validation data as well as it fits the training data. This is called overfitting, and is particularly likely to happen when the size of the training data set is small, or when the number of parameters in the model is large. Cross-validation is a way to predict the fit of a model to a hypothetical validation set when an explicit validation set is not available.
But how can we come up with a hypothetical validation set for the Universe?
Some of the lines seem to suggest exactly what those researchers seem to propose—not allowing all of the observations to go into the theory, but like the NetFlix Prize, holding back some data as a test:
One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set).
In K-fold cross-validation, the original sample is randomly partitioned into K subsamples. Of the K subsamples, a single subsample is retained as the validation data for testing the model, and the remaining K − 1 subsamples are used as training data.
As the name suggests, leave-one-out cross-validation (LOOCV) involves using a single observation from the original sample as the validation data, and the remaining observations as the training data.
But how can we come up with a hypothetical validation set for the Universe?
Releasing the data in dribs and drabs doesn’t address this either.
Some of the lines seem to suggest exactly what those researchers seem to propose—not allowing all of the observations to go into the theory, but like the NetFlix Prize, holding back some data as a test:
There’s a difference between, on the one hand, having the data freely available and being intelligent enough to use cross-validation, and on the other, having someone paternalistically hold the data back from you.
Releasing the data in dribs and drabs doesn’t address this either.
It does force researchers into an ad hoc cross-validation scheme, doesn’t it?
There’s a difference between, on the one hand, having the data freely available and being intelligent enough to use cross-validation, and on the other, having someone paternalistically hold back the data from you.
If you start from the premise that researchers may fall into the overfitting trap, then you’re already treating them adversarily. And if just one researcher overfitting a theory and so becoming irrefutable will screw everything up, then the paranoid approach to data release prevents that total cockup (at the cost of some interim inefficiencies, by hindering the responsible, good, researchers).
And if just one researcher overfitting a theory and so becoming irrefutable will screw everything up, then the paranoid approach to data release prevents that total cockup
It doesn’t prevent that entirely reliably, either. How much time are you going to give the researchers to come up with hypotheses before you release the full set? And what do you do if someone comes up with a new hypothesis after the full release, so mindblowingly elegant and simple that it blows all of the previously published ones out of the water?
If you think that some later hypotheses based on the full set might still be accepted, then you’re still vulnerable to falling into the overfitting trap after the full release. If you don’t, then you’ll be locked forever into the theories scientists came up with during the partial-release window, and no later advances in the scientific method, rationality, computing, math or even the intelligence of researchers will allow you to improve upon them.
This approach might get you some extra empirical evidence, but it will be empirical evidence about theories put together under quite limited conditions, compared to what will be available to later civilization.
The fear seems to be that people might propose theories with enough degrees of freedom that they can fine-tune it to fit the complete data very closely. But, as long as the fitting process is repeatable, i.e., no numerology, cross-validation can be applied to discover which theories are genuinely predictive and which are over-fitting.
OK; so I’m looking at http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29 and the key nugget seems to be:
But how can we come up with a hypothetical validation set for the Universe?
Some of the lines seem to suggest exactly what those researchers seem to propose—not allowing all of the observations to go into the theory, but like the NetFlix Prize, holding back some data as a test:
Releasing the data in dribs and drabs doesn’t address this either.
There’s a difference between, on the one hand, having the data freely available and being intelligent enough to use cross-validation, and on the other, having someone paternalistically hold the data back from you.
It does force researchers into an ad hoc cross-validation scheme, doesn’t it?
If you start from the premise that researchers may fall into the overfitting trap, then you’re already treating them adversarily. And if just one researcher overfitting a theory and so becoming irrefutable will screw everything up, then the paranoid approach to data release prevents that total cockup (at the cost of some interim inefficiencies, by hindering the responsible, good, researchers).
It doesn’t prevent that entirely reliably, either. How much time are you going to give the researchers to come up with hypotheses before you release the full set? And what do you do if someone comes up with a new hypothesis after the full release, so mindblowingly elegant and simple that it blows all of the previously published ones out of the water?
If you think that some later hypotheses based on the full set might still be accepted, then you’re still vulnerable to falling into the overfitting trap after the full release. If you don’t, then you’ll be locked forever into the theories scientists came up with during the partial-release window, and no later advances in the scientific method, rationality, computing, math or even the intelligence of researchers will allow you to improve upon them.
This approach might get you some extra empirical evidence, but it will be empirical evidence about theories put together under quite limited conditions, compared to what will be available to later civilization.
I’d rather wait for researchers to screw up and then hammer them.