Yeah, I don’t find a linear regression on pairs of models to be all that informative:
the parameterization as % is misleading, squashing differences
especially as you would expect for 2 reasons performance to spend most of its time near 0 or 1: near 1, because we are so excited about DL because it is solving so many tasks, and once solved they stay solved; and near 0 because, so many of the tasks now approaching 1, we need to create even more super-duper hard, now usually adversarially constructed, tasks, where all the models start off around 0
it also would tend to exaggerate or erase the plateaus and phase transitions based on where the model sizes start in the transition and what the base rate of error is, neither of which has any principled connection to the phenomenon of interest (it is not important if the baseline is 10% error because it’s a multiple-choice with 10 options instead of 25% because it had 4 options).
individual tasks have a lot of random sampling error: ie. if we constructed it a second time with fresh data, we would see the same models get different scores
individual models have ‘sampling error’: each model is a sample from the Bayesian posterior and will make somewhat different predictions; this will lead to different scores on the same task (ie. if we trained the same model in exactly the same way except for the random seed & other nondeterminism like GPU ops/network, it would get different scores on the same task)
comparing a single pair of models is not very powerful:
You don’t have ‘n = 62’, you have ‘n = 1’. (Imagine if you broke down each task into its individual single-questions instead of the fairly arbitrary existing task chunks. Do you now suddenly have n=100,000 or whatever? No, of course not.)
range restriction in model scaling: these are power/log phenomenon; pair of models differing by a single order of magnitude is not informative.
Plotting the predictive loss over many models and multiple orders of magnitude is meaningful. Plotting it versus normalized performance across many tasks is also reasonable albeit highly noisy. Plotting individual tasks of single checkpoints against somewhat larger checkpoints is a recipe for baking in so much noise at so many levels, deflating everything by measurement error so far towards zero, that I’m not too surprised one doesn’t see any clear patterns in the residuals and may be chasing noise.
(Logit performances for the 400M model and 7B model were highly significantly different, p = 6*10^-7 in single factor ANOVA.)
In the case of MMLU, because random performance is 25% rather than 0%, I tried subtracting 14% (the lowest score of any model on any task) before running the logit, to try to reduce noise from floor effects; the correlation was still zero. The highest score of any model on any task was 96%, few were above 90% and averages were in the 25%-75% range, so I don’t think ceiling effects are currently significant here.
If the performance of any given model on any given task were super noisy, you should expect negative correlation, not zero, because of reversion-to-mean effects. Eg., here is a simulation I ran with n = 60 simulated tasks, with different values for the ratio between “how much variance is there in task scalability?” and “how much noise is there for the performance of a given model on a given task?”:
The way questions are chunked is pretty non-arbitrary, in that questions within a task are much more similar to each other than random questions? Eg., here are two questions from one random BIG-bench task and two from a second task:
“input”: “Each time you play your guitar, you are playing an instrument.”, “target_scores”: { “causal”: 0, “correlative”: 0, “neutral”: 1 } “input”: “Looking into a bright light makes my eyes water.”, “target_scores”: { “causal”: 1, “correlative”: 0, “neutral”: 0 }
If the performance of any given model on any given task were super noisy, you should expect negative correlation, not zero, because of reversion-to-mean effects.
Which you see comparing the two smaller model’s improvements to each other, no?
I don’t expect reversion to the mean to be clearly dominant wrt. 7.1→280B because the effect is much smaller than the capability jump there. It’s also worth remembering that these smaller model outputs can be arbitrary but not necessarily random; I wouldn’t expect one 1000M parameter model’s outputs to be fully decorrelated with another 1000M parameter model’s outputs even if performance was pure chance.
The PaLM NLU/BigBench numbers do seem to be positively correlated, in contrast, especially when using logits or looking at error rate, as is more reasonable of them given the nonzero performance.
E: Did I say something dumb? I cannot figure out why I’ve been singularly downvoted here. AFAICT the things I am saying are primarily factual.
Yeah, I don’t find a linear regression on pairs of models to be all that informative:
the parameterization as % is misleading, squashing differences
especially as you would expect for 2 reasons performance to spend most of its time near 0 or 1: near 1, because we are so excited about DL because it is solving so many tasks, and once solved they stay solved; and near 0 because, so many of the tasks now approaching 1, we need to create even more super-duper hard, now usually adversarially constructed, tasks, where all the models start off around 0
it also would tend to exaggerate or erase the plateaus and phase transitions based on where the model sizes start in the transition and what the base rate of error is, neither of which has any principled connection to the phenomenon of interest (it is not important if the baseline is 10% error because it’s a multiple-choice with 10 options instead of 25% because it had 4 options).
individual tasks have a lot of random sampling error: ie. if we constructed it a second time with fresh data, we would see the same models get different scores
individual models have ‘sampling error’: each model is a sample from the Bayesian posterior and will make somewhat different predictions; this will lead to different scores on the same task (ie. if we trained the same model in exactly the same way except for the random seed & other nondeterminism like GPU ops/network, it would get different scores on the same task)
comparing a single pair of models is not very powerful:
You don’t have ‘n = 62’, you have ‘n = 1’. (Imagine if you broke down each task into its individual single-questions instead of the fairly arbitrary existing task chunks. Do you now suddenly have n=100,000 or whatever? No, of course not.)
range restriction in model scaling: these are power/log phenomenon; pair of models differing by a single order of magnitude is not informative.
Plotting the predictive loss over many models and multiple orders of magnitude is meaningful. Plotting it versus normalized performance across many tasks is also reasonable albeit highly noisy. Plotting individual tasks of single checkpoints against somewhat larger checkpoints is a recipe for baking in so much noise at so many levels, deflating everything by measurement error so far towards zero, that I’m not too surprised one doesn’t see any clear patterns in the residuals and may be chasing noise.
I re-ran the Gopher MMLU and Big-Bench data as logits rather than raw percentages, the correlation is still zero:
https://i.imgur.com/mSeJoZM.png
(Logit performances for the 400M model and 7B model were highly significantly different, p = 6*10^-7 in single factor ANOVA.)
In the case of MMLU, because random performance is 25% rather than 0%, I tried subtracting 14% (the lowest score of any model on any task) before running the logit, to try to reduce noise from floor effects; the correlation was still zero. The highest score of any model on any task was 96%, few were above 90% and averages were in the 25%-75% range, so I don’t think ceiling effects are currently significant here.
If the performance of any given model on any given task were super noisy, you should expect negative correlation, not zero, because of reversion-to-mean effects. Eg., here is a simulation I ran with n = 60 simulated tasks, with different values for the ratio between “how much variance is there in task scalability?” and “how much noise is there for the performance of a given model on a given task?”:
https://i.imgur.com/1I71IO0.png
If there is a lot of noise, the correlation is negative; it’s pretty unusual to get exactly zero. (Code: https://gist.github.com/rationalism/b8925017700605b339b8f8439283d670)
The way questions are chunked is pretty non-arbitrary, in that questions within a task are much more similar to each other than random questions? Eg., here are two questions from one random BIG-bench task and two from a second task:
“input”: “Each time you play your guitar, you are playing an instrument.”,
“target_scores”: { “causal”: 0, “correlative”: 0, “neutral”: 1 }
“input”: “Looking into a bright light makes my eyes water.”,
“target_scores”: { “causal”: 1, “correlative”: 0, “neutral”: 0 }
Q: (1 + 1 + 1 + 1) =
A: 4
Q: ((2 * 2) + (3 * 1)) =
A: 7
Which you see comparing the two smaller model’s improvements to each other, no?
I don’t expect reversion to the mean to be clearly dominant wrt. 7.1→280B because the effect is much smaller than the capability jump there. It’s also worth remembering that these smaller model outputs can be arbitrary but not necessarily random; I wouldn’t expect one 1000M parameter model’s outputs to be fully decorrelated with another 1000M parameter model’s outputs even if performance was pure chance.
The PaLM NLU/BigBench numbers do seem to be positively correlated, in contrast, especially when using logits or looking at error rate, as is more reasonable of them given the nonzero performance.
E: Did I say something dumb? I cannot figure out why I’ve been singularly downvoted here. AFAICT the things I am saying are primarily factual.