RyanCarey comments on Using machine learning to predict romantic compatibility: empirical results

RyanCarey 18 Dec 2014 0:36 UTC
0 points
A lecturer suggested that its good to do leave-one-out cross validation if you have 20ish data points. 20-200 is leave out 10% territory, and then with > 200, it’s good to just use 30% for a test set. Although I he didn’t justify this. And it doesn’t seem that important to stick to.
- benkuhn 18 Dec 2014 6:11 UTC
  0 points
  Parent
  I would beware the opinions of individual people on this, as I don’t believe it’s a very settled question. For instance, my favorite textbook author, Prof. Frank Harrell, thinks 22k is “just barely large enough to do split-sample validation.” The adequacy of leave-one-out versus 10-fold depends on your available computational power as well as your sample size. 200 seems certainly not enough to hold out 30% as a test set; there’s way too much variance.
  - RyanCarey 18 Dec 2014 6:50 UTC
    0 points
    Parent
    That’s interesting, and a useful update.
    
    On thinking about this more, I suppose the LOO/k-fold/split-sample question should depend a lot on a bunch of factors relating to how much signal/noise you expect. In the case you link to, they’re looking at behavioural health, which is far from deterministic, where events like heart attacks only occur in <5% of the population that you’re studying. And then the question-asker is trying to tease out differences that may be quite subtle between the performance of SVM, logistic regression, et cetera.
  - JonahS 18 Dec 2014 6:34 UTC
    0 points
    Parent
    also depends on the number of features in the model, their distribution, the distribution of the target variable, etc.