If the performance of any given model on any given task were super noisy, you should expect negative correlation, not zero, because of reversion-to-mean effects.
Which you see comparing the two smaller model’s improvements to each other, no?
I don’t expect reversion to the mean to be clearly dominant wrt. 7.1→280B because the effect is much smaller than the capability jump there. It’s also worth remembering that these smaller model outputs can be arbitrary but not necessarily random; I wouldn’t expect one 1000M parameter model’s outputs to be fully decorrelated with another 1000M parameter model’s outputs even if performance was pure chance.
The PaLM NLU/BigBench numbers do seem to be positively correlated, in contrast, especially when using logits or looking at error rate, as is more reasonable of them given the nonzero performance.
E: Did I say something dumb? I cannot figure out why I’ve been singularly downvoted here. AFAICT the things I am saying are primarily factual.
Which you see comparing the two smaller model’s improvements to each other, no?
I don’t expect reversion to the mean to be clearly dominant wrt. 7.1→280B because the effect is much smaller than the capability jump there. It’s also worth remembering that these smaller model outputs can be arbitrary but not necessarily random; I wouldn’t expect one 1000M parameter model’s outputs to be fully decorrelated with another 1000M parameter model’s outputs even if performance was pure chance.
The PaLM NLU/BigBench numbers do seem to be positively correlated, in contrast, especially when using logits or looking at error rate, as is more reasonable of them given the nonzero performance.
E: Did I say something dumb? I cannot figure out why I’ve been singularly downvoted here. AFAICT the things I am saying are primarily factual.