Yes, roughly speaking, if you multiply the VC dimension by n, then you need n times as much training data to achieve the same performance. (More precise statement here: https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_dimension#Uses) There are also a few other bounds you can get based on VC dimension. In practice these bounds are way too large to be useful, but an algorithm with much higher VC dimension will generally overfit more.
Oh yeah, thanks, I think I remember seeing that. Do you happen to know the assumptions that proof makes? I’m having a hard time finding it in Vapnik’s textbook. I vaguely remember it assuming that the correct hypothesis is in our hypothesis class, which made it seem kind of uninteresting.
To be clear, I agree it’s easier to overfit if you try lots of models, but my explanation of this would be more Bayesian than Vapnik’s. (Maybe something like: If a researcher restricts themselves to only 10 models, they will choose 10 models they feel relatively optimistic about/assign a high prior probability to. A high prior with a high likelihood gives us better generalization than a low prior with a slightly higher likelihood; the posterior probability of the first model is greater.)
1. The error on real data will be similar to the error on the training set + epsilon, where epsilon is roughly proportional to (datapoints / VC dimension.) This is the one I linked above.
2. The error on real data will be similar to the error of the best hypothesis in the hypothesis class, with similar proportionality
3. Special case of 2 – if the true hypothesis is in the hypothesis class, then the absolute error will be < epsilon (since the absolute error is just the difference from the true, best hypothesis.)
3 is probably the one you’re thinking of, but you don’t need the hypothesis to be in the class.
Is there a theoretical justification for using VC dimension as a basis for generalization, or is it treated as an axiomatic desideratum?
Yes, roughly speaking, if you multiply the VC dimension by n, then you need n times as much training data to achieve the same performance. (More precise statement here: https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_dimension#Uses) There are also a few other bounds you can get based on VC dimension. In practice these bounds are way too large to be useful, but an algorithm with much higher VC dimension will generally overfit more.
Oh yeah, thanks, I think I remember seeing that. Do you happen to know the assumptions that proof makes? I’m having a hard time finding it in Vapnik’s textbook. I vaguely remember it assuming that the correct hypothesis is in our hypothesis class, which made it seem kind of uninteresting.
To be clear, I agree it’s easier to overfit if you try lots of models, but my explanation of this would be more Bayesian than Vapnik’s. (Maybe something like: If a researcher restricts themselves to only 10 models, they will choose 10 models they feel relatively optimistic about/assign a high prior probability to. A high prior with a high likelihood gives us better generalization than a low prior with a slightly higher likelihood; the posterior probability of the first model is greater.)
A good exposition of the related theorems is in Chapter 6 of Understanding Machine Learning (https://www.amazon.com/Understanding-Machine-Learning-Theory-Algorithms/dp/1107057132/ref=sr_1_1?crid=2MXVW7VOQH6FT&keywords=understanding+machine+learning+from+theory+to+algorithms&qid=1562085244&s=gateway&sprefix=understanding+machine+%2Caps%2C196&sr=8-1)
There are several related theorems. Roughly:
1. The error on real data will be similar to the error on the training set + epsilon, where epsilon is roughly proportional to (datapoints / VC dimension.) This is the one I linked above.
2. The error on real data will be similar to the error of the best hypothesis in the hypothesis class, with similar proportionality
3. Special case of 2 – if the true hypothesis is in the hypothesis class, then the absolute error will be < epsilon (since the absolute error is just the difference from the true, best hypothesis.)
3 is probably the one you’re thinking of, but you don’t need the hypothesis to be in the class.