Oh yeah, thanks, I think I remember seeing that. Do you happen to know the assumptions that proof makes? I’m having a hard time finding it in Vapnik’s textbook. I vaguely remember it assuming that the correct hypothesis is in our hypothesis class, which made it seem kind of uninteresting.
To be clear, I agree it’s easier to overfit if you try lots of models, but my explanation of this would be more Bayesian than Vapnik’s. (Maybe something like: If a researcher restricts themselves to only 10 models, they will choose 10 models they feel relatively optimistic about/assign a high prior probability to. A high prior with a high likelihood gives us better generalization than a low prior with a slightly higher likelihood; the posterior probability of the first model is greater.)
1. The error on real data will be similar to the error on the training set + epsilon, where epsilon is roughly proportional to (datapoints / VC dimension.) This is the one I linked above.
2. The error on real data will be similar to the error of the best hypothesis in the hypothesis class, with similar proportionality
3. Special case of 2 – if the true hypothesis is in the hypothesis class, then the absolute error will be < epsilon (since the absolute error is just the difference from the true, best hypothesis.)
3 is probably the one you’re thinking of, but you don’t need the hypothesis to be in the class.
Oh yeah, thanks, I think I remember seeing that. Do you happen to know the assumptions that proof makes? I’m having a hard time finding it in Vapnik’s textbook. I vaguely remember it assuming that the correct hypothesis is in our hypothesis class, which made it seem kind of uninteresting.
To be clear, I agree it’s easier to overfit if you try lots of models, but my explanation of this would be more Bayesian than Vapnik’s. (Maybe something like: If a researcher restricts themselves to only 10 models, they will choose 10 models they feel relatively optimistic about/assign a high prior probability to. A high prior with a high likelihood gives us better generalization than a low prior with a slightly higher likelihood; the posterior probability of the first model is greater.)
A good exposition of the related theorems is in Chapter 6 of Understanding Machine Learning (https://www.amazon.com/Understanding-Machine-Learning-Theory-Algorithms/dp/1107057132/ref=sr_1_1?crid=2MXVW7VOQH6FT&keywords=understanding+machine+learning+from+theory+to+algorithms&qid=1562085244&s=gateway&sprefix=understanding+machine+%2Caps%2C196&sr=8-1)
There are several related theorems. Roughly:
1. The error on real data will be similar to the error on the training set + epsilon, where epsilon is roughly proportional to (datapoints / VC dimension.) This is the one I linked above.
2. The error on real data will be similar to the error of the best hypothesis in the hypothesis class, with similar proportionality
3. Special case of 2 – if the true hypothesis is in the hypothesis class, then the absolute error will be < epsilon (since the absolute error is just the difference from the true, best hypothesis.)
3 is probably the one you’re thinking of, but you don’t need the hypothesis to be in the class.