A lecturer suggested that its good to do leave-one-out cross validation if you have 20ish data points. 20-200 is leave out 10% territory, and then with > 200, it’s good to just use 30% for a test set. Although I he didn’t justify this. And it doesn’t seem that important to stick to.
I would beware the opinions of individual people on this, as I don’t believe it’s a very settled question. For instance, my favorite textbook author, Prof. Frank Harrell, thinks 22k is “just barely large enough to do split-sample validation.” The adequacy of leave-one-out versus 10-fold depends on your available computational power as well as your sample size. 200 seems certainly not enough to hold out 30% as a test set; there’s way too much variance.
On thinking about this more, I suppose the LOO/k-fold/split-sample question should depend a lot on a bunch of factors relating to how much signal/noise you expect. In the case you link to, they’re looking at behavioural health, which is far from deterministic, where events like heart attacks only occur in <5% of the population that you’re studying. And then the question-asker is trying to tease out differences that may be quite subtle between the performance of SVM, logistic regression, et cetera.
A lecturer suggested that its good to do leave-one-out cross validation if you have 20ish data points. 20-200 is leave out 10% territory, and then with > 200, it’s good to just use 30% for a test set. Although I he didn’t justify this. And it doesn’t seem that important to stick to.
I would beware the opinions of individual people on this, as I don’t believe it’s a very settled question. For instance, my favorite textbook author, Prof. Frank Harrell, thinks 22k is “just barely large enough to do split-sample validation.” The adequacy of leave-one-out versus 10-fold depends on your available computational power as well as your sample size. 200 seems certainly not enough to hold out 30% as a test set; there’s way too much variance.
That’s interesting, and a useful update.
On thinking about this more, I suppose the LOO/k-fold/split-sample question should depend a lot on a bunch of factors relating to how much signal/noise you expect. In the case you link to, they’re looking at behavioural health, which is far from deterministic, where events like heart attacks only occur in <5% of the population that you’re studying. And then the question-asker is trying to tease out differences that may be quite subtle between the performance of SVM, logistic regression, et cetera.
also depends on the number of features in the model, their distribution, the distribution of the target variable, etc.