Thanks for your interest :-). You’ve preempted one of my next posts, where I write:
Because of the small size of the data set, there’s an unusually great need to guard against overfitting. For this reason, I did cross validation at the level of individual events: the way in which I generated predictions for an event E was to fit a model using all other events as a training set. This gives a setup in which including a feature that superficially appears to increase predictive power based on one or two events will degrade performance overall in a way that’s reflected in the error rate associated with the predictions. The standard that I set for including a feature is that it not only improve predictive power when we average over all events, but that it also improve performance for a majority of the events in the sample (at least 5 out of 9).
[...]
Our measure of error is log loss, a metric of the overall accuracy of the probability estimates. A probability estimate of p for a yes decision is associated with a penalty of log(p) when the actual decision is yes, and a penalty of log(1 - p) when the decision is no. If one knows that an event occurs with frequency f and has no other information, log loss is minimized by assigning the event probability f. If one requires that the measurement of error obey certain coherence properties, log loss is the unique measure of error with this property. Lower log loss generally corresponds to fewer false positives and fewer false negatives, and is the best choice for a measure of error when it comes to our ultimate goal of predicting matches and making recommendations based on these predictions.
A lecturer suggested that its good to do leave-one-out cross validation if you have 20ish data points. 20-200 is leave out 10% territory, and then with > 200, it’s good to just use 30% for a test set. Although I he didn’t justify this. And it doesn’t seem that important to stick to.
I would beware the opinions of individual people on this, as I don’t believe it’s a very settled question. For instance, my favorite textbook author, Prof. Frank Harrell, thinks 22k is “just barely large enough to do split-sample validation.” The adequacy of leave-one-out versus 10-fold depends on your available computational power as well as your sample size. 200 seems certainly not enough to hold out 30% as a test set; there’s way too much variance.
On thinking about this more, I suppose the LOO/k-fold/split-sample question should depend a lot on a bunch of factors relating to how much signal/noise you expect. In the case you link to, they’re looking at behavioural health, which is far from deterministic, where events like heart attacks only occur in <5% of the population that you’re studying. And then the question-asker is trying to tease out differences that may be quite subtle between the performance of SVM, logistic regression, et cetera.
Did you hold back some data for out-of-sample testing? By which criteria do you evaluate your model?
Thanks for your interest :-). You’ve preempted one of my next posts, where I write:
[...]
A lecturer suggested that its good to do leave-one-out cross validation if you have 20ish data points. 20-200 is leave out 10% territory, and then with > 200, it’s good to just use 30% for a test set. Although I he didn’t justify this. And it doesn’t seem that important to stick to.
I would beware the opinions of individual people on this, as I don’t believe it’s a very settled question. For instance, my favorite textbook author, Prof. Frank Harrell, thinks 22k is “just barely large enough to do split-sample validation.” The adequacy of leave-one-out versus 10-fold depends on your available computational power as well as your sample size. 200 seems certainly not enough to hold out 30% as a test set; there’s way too much variance.
That’s interesting, and a useful update.
On thinking about this more, I suppose the LOO/k-fold/split-sample question should depend a lot on a bunch of factors relating to how much signal/noise you expect. In the case you link to, they’re looking at behavioural health, which is far from deterministic, where events like heart attacks only occur in <5% of the population that you’re studying. And then the question-asker is trying to tease out differences that may be quite subtle between the performance of SVM, logistic regression, et cetera.
also depends on the number of features in the model, their distribution, the distribution of the target variable, etc.
Excellent :-) I’ll postpone my comments till that post, then.