(Logit performances for the 400M model and 7B model were highly significantly different, p = 6*10^-7 in single factor ANOVA.)
In the case of MMLU, because random performance is 25% rather than 0%, I tried subtracting 14% (the lowest score of any model on any task) before running the logit, to try to reduce noise from floor effects; the correlation was still zero. The highest score of any model on any task was 96%, few were above 90% and averages were in the 25%-75% range, so I don’t think ceiling effects are currently significant here.
If the performance of any given model on any given task were super noisy, you should expect negative correlation, not zero, because of reversion-to-mean effects. Eg., here is a simulation I ran with n = 60 simulated tasks, with different values for the ratio between “how much variance is there in task scalability?” and “how much noise is there for the performance of a given model on a given task?”:
The way questions are chunked is pretty non-arbitrary, in that questions within a task are much more similar to each other than random questions? Eg., here are two questions from one random BIG-bench task and two from a second task:
“input”: “Each time you play your guitar, you are playing an instrument.”, “target_scores”: { “causal”: 0, “correlative”: 0, “neutral”: 1 } “input”: “Looking into a bright light makes my eyes water.”, “target_scores”: { “causal”: 1, “correlative”: 0, “neutral”: 0 }
If the performance of any given model on any given task were super noisy, you should expect negative correlation, not zero, because of reversion-to-mean effects.
Which you see comparing the two smaller model’s improvements to each other, no?
I don’t expect reversion to the mean to be clearly dominant wrt. 7.1→280B because the effect is much smaller than the capability jump there. It’s also worth remembering that these smaller model outputs can be arbitrary but not necessarily random; I wouldn’t expect one 1000M parameter model’s outputs to be fully decorrelated with another 1000M parameter model’s outputs even if performance was pure chance.
The PaLM NLU/BigBench numbers do seem to be positively correlated, in contrast, especially when using logits or looking at error rate, as is more reasonable of them given the nonzero performance.
E: Did I say something dumb? I cannot figure out why I’ve been singularly downvoted here. AFAICT the things I am saying are primarily factual.
I re-ran the Gopher MMLU and Big-Bench data as logits rather than raw percentages, the correlation is still zero:
https://i.imgur.com/mSeJoZM.png
(Logit performances for the 400M model and 7B model were highly significantly different, p = 6*10^-7 in single factor ANOVA.)
In the case of MMLU, because random performance is 25% rather than 0%, I tried subtracting 14% (the lowest score of any model on any task) before running the logit, to try to reduce noise from floor effects; the correlation was still zero. The highest score of any model on any task was 96%, few were above 90% and averages were in the 25%-75% range, so I don’t think ceiling effects are currently significant here.
If the performance of any given model on any given task were super noisy, you should expect negative correlation, not zero, because of reversion-to-mean effects. Eg., here is a simulation I ran with n = 60 simulated tasks, with different values for the ratio between “how much variance is there in task scalability?” and “how much noise is there for the performance of a given model on a given task?”:
https://i.imgur.com/1I71IO0.png
If there is a lot of noise, the correlation is negative; it’s pretty unusual to get exactly zero. (Code: https://gist.github.com/rationalism/b8925017700605b339b8f8439283d670)
The way questions are chunked is pretty non-arbitrary, in that questions within a task are much more similar to each other than random questions? Eg., here are two questions from one random BIG-bench task and two from a second task:
“input”: “Each time you play your guitar, you are playing an instrument.”,
“target_scores”: { “causal”: 0, “correlative”: 0, “neutral”: 1 }
“input”: “Looking into a bright light makes my eyes water.”,
“target_scores”: { “causal”: 1, “correlative”: 0, “neutral”: 0 }
Q: (1 + 1 + 1 + 1) =
A: 4
Q: ((2 * 2) + (3 * 1)) =
A: 7
Which you see comparing the two smaller model’s improvements to each other, no?
I don’t expect reversion to the mean to be clearly dominant wrt. 7.1→280B because the effect is much smaller than the capability jump there. It’s also worth remembering that these smaller model outputs can be arbitrary but not necessarily random; I wouldn’t expect one 1000M parameter model’s outputs to be fully decorrelated with another 1000M parameter model’s outputs even if performance was pure chance.
The PaLM NLU/BigBench numbers do seem to be positively correlated, in contrast, especially when using logits or looking at error rate, as is more reasonable of them given the nonzero performance.
E: Did I say something dumb? I cannot figure out why I’ve been singularly downvoted here. AFAICT the things I am saying are primarily factual.