Younes Kamel comments on What we know about machine learning’s replication crisis

Younes Kamel 6 Mar 2022 14:59 UTC
2 points
You’re right, I should have written “but it turns out most of them could be beaten by the untuned version of several competitors on the 5 five datasets”, as one can see in the figures. Thank you for pointing it out, I’ll edit the post.
- TLW 8 Mar 2022 5:09 UTC
  3 points
  Parent
  I should have written “but it turns out most of them could be beaten by the untuned version of several competitors on the 5 five datasets”,
  You’re inadvertently P-hacking I think. There are some that are beaten on specific tasks by the untuned version of several competitors, but these improve in other tasks.
  Consider the following simple model:
  1. Each algorithm has a normal distribution of scenario-specific performances with mean =<some parameter dependent on algorithm> and stddev=sqrt(2)/2
  2. Each scenario has a normal distribution of run-specific performances with mean=<scenario-specific performance for this algorithm and task> and stddev=sqrt(2)/2
  I have algorithm X, with mean=0. Overall run performance is roughly ^[1] a normal distribution with mean=0 and stddev=1.
  I improve this into algorithm Y, with mean=1. Overall run performance is roughly ^[1] a normal distribution with mean=1 and stddev=1.
  Algorithm X will have a higher mean scenario-specific performance than algorithm Y ~16%^[2] of the time!
  1. ^
    I don’t know offhand if this is exact; it appears close at the very least.
  2. ^
    >>> import random
    >>> count = 10**6
    >>> sum(random.normalvariate(mu=0, sigma=2**0.5/2) > random.normalvariate(mu=1, sigma=2**0.5/2) for _ in range(count)) / count
    0.158625
  - Younes Kamel 8 Mar 2022 19:47 UTC
    3 points
    Parent
    Perhaps the most important takeaway from our study is hidden in plain sight: the field is in danger of being drowned by noise. Different optimizers exhibit a surprisingly similar performance distribution compared to a single method that is re-tuned or simply re-run with different random seeds. It is thus questionable how much insight the development of new methods yields, at least if they are conceptually and functionally close to the existing population.
    This is from the author’s conclusion. They do also acknowledge that a couple optimizers seem to be better than others across tasks and datasets, and I agree with them (and with you if that’s your point). But most optimizers do not meet the “significant improvement” claims their authors have been making. They also say most tuned algorithms can be equaled by trying seevral un-tuned algorithms. So the point is twofold :
    
    1. Most new algorithms can be equaled or beaten by re-tuning of most old algorithms.
    2. Their tuned versions can be equaled or beaten by many un-tuned versions of old algorithms.
    This seems to be consistent with there being no overwhelming winner and low variance in algorithm performance.
    
    If I understand your model correctly, and let me know if I do, if an algorithm Y improves performance by 1 std on a specific task, it woulds still get beaten by an unimproved algorithm 16% of the time. Sure, but you have to compute the probability of the Y algorithm (mean=1, std=1) being beaten by the X1, X2, X3, X4 algorithms (all mean=0, std=1) , which is what is happening in the authors’ experiment, and it is much lower.