You say “As an upper bound we can use 200 gigabytes, the length of the human genome”. But the human genome actually consists of about 3 billion base pairs, each specifiable using two bits, so it’s about 0.75 gigabytes in size, even before taking account that it’s somewhat compressible, due to repeats and other redundancies.
You also say “The entropy (information content) of your training dataset must exceed the complexity of your model.” But actually it is typical for neural network models to have more parameters than there are numbers in the training data. Overfitting is avoided by methods such as “dropout” and “early stopping”. One could argue that these methods reduce the “effective” complexity of the model to less than the entropy of the training data, but if you do that, the statement verges on being tautological rather than substantive. For Bayesian learning methods, it is certainly not true that the complexity of the model must be limited to the entropy of the data set—at least in theory, a Bayesian model can be specified before you even know how much data you will have, and then doesn’t need to be modified based on how much data you actually end up with.
Thank you for correcting the size of the human genome. I have fixed the number.
My claim is indeed “that [early stopping] reduce[s] the ‘effective’ complexity of the model to less than the entropy of the training data”. I consider such big data methods to be in a separate, data-inefficient category separate from Bayesian learning methods. Thus, “[f]or Bayesian learning methods, it is certainly not true that the complexity of the model must be limited to the entropy of the data set”.
Two technical issues:
You say “As an upper bound we can use 200 gigabytes, the length of the human genome”. But the human genome actually consists of about 3 billion base pairs, each specifiable using two bits, so it’s about 0.75 gigabytes in size, even before taking account that it’s somewhat compressible, due to repeats and other redundancies.
You also say “The entropy (information content) of your training dataset must exceed the complexity of your model.” But actually it is typical for neural network models to have more parameters than there are numbers in the training data. Overfitting is avoided by methods such as “dropout” and “early stopping”. One could argue that these methods reduce the “effective” complexity of the model to less than the entropy of the training data, but if you do that, the statement verges on being tautological rather than substantive. For Bayesian learning methods, it is certainly not true that the complexity of the model must be limited to the entropy of the data set—at least in theory, a Bayesian model can be specified before you even know how much data you will have, and then doesn’t need to be modified based on how much data you actually end up with.
Thank you for correcting the size of the human genome. I have fixed the number.
My claim is indeed “that [early stopping] reduce[s] the ‘effective’ complexity of the model to less than the entropy of the training data”. I consider such big data methods to be in a separate, data-inefficient category separate from Bayesian learning methods. Thus, “[f]or Bayesian learning methods, it is certainly not true that the complexity of the model must be limited to the entropy of the data set”.