Individual MMMLU tasks are extremely noisy. They’re so noisy that the paper actually specifically recommends that you don’t draw conclusions from performance on individual tasks and instead look at four high level topical categories. The individual tasks also have extremely large variances in their variance. Some of them are pretty easy for a college educated adult, while others have genuine experts scoring less than 80%.
This is compounded by the fact that the sample sizes vary wildly. While many of the tasks have around 100 questions, while at the other extreme there is a task with 1534 questions. The aggregated topics however have the same number of questions per topic, because the task was explicitly designed for analysis along those lines.
I don’t know the extent to which these issues plague the other evaluations, but I think more care needs to be taken before drawing conclusions with highly noisy data.
In particular, extremely noisy data does not explain the results here, unless I’ve totally missed something. If the data is super noisy, the correlation should be negative, not zero, due to regression-to-mean effects (as indeed we saw for the smallest Gopher models, which are presumably so tiny that performance is essentially random).
Doesn’t that mean that you are getting some predictiveness by looking at momentum? If progress on a task was totally unpredictable, with no signal and all noise, then your way of carving up the data would produce negative correlations. Instead you’re mostly finding correlations near zero, or slightly positive, which means that there is just about enough signal to counteract that noise.
The signal to noise ratio is going to depend on a lot of contingent factors. There will be more noise if there are fewer questions on a task. There will be less signal from one model version to the next if there is a smaller increase in model size, or if the task is one where improvement happens very gradually as models scale up (though in those cases you could find a clearer signal by looking across several model versions, rather than just two consecutive jumps).
Individual MMMLU tasks are extremely noisy. They’re so noisy that the paper actually specifically recommends that you don’t draw conclusions from performance on individual tasks and instead look at four high level topical categories. The individual tasks also have extremely large variances in their variance. Some of them are pretty easy for a college educated adult, while others have genuine experts scoring less than 80%.
This is compounded by the fact that the sample sizes vary wildly. While many of the tasks have around 100 questions, while at the other extreme there is a task with 1534 questions. The aggregated topics however have the same number of questions per topic, because the task was explicitly designed for analysis along those lines.
I don’t know the extent to which these issues plague the other evaluations, but I think more care needs to be taken before drawing conclusions with highly noisy data.
See my response to Gwern: https://www.lesswrong.com/posts/G993PFTwqqdQv4eTg/is-ai-progress-impossible-to-predict?commentId=MhnGnBvJjgJ5vi5Mb
In particular, extremely noisy data does not explain the results here, unless I’ve totally missed something. If the data is super noisy, the correlation should be negative, not zero, due to regression-to-mean effects (as indeed we saw for the smallest Gopher models, which are presumably so tiny that performance is essentially random).
Doesn’t that mean that you are getting some predictiveness by looking at momentum? If progress on a task was totally unpredictable, with no signal and all noise, then your way of carving up the data would produce negative correlations. Instead you’re mostly finding correlations near zero, or slightly positive, which means that there is just about enough signal to counteract that noise.
The signal to noise ratio is going to depend on a lot of contingent factors. There will be more noise if there are fewer questions on a task. There will be less signal from one model version to the next if there is a smaller increase in model size, or if the task is one where improvement happens very gradually as models scale up (though in those cases you could find a clearer signal by looking across several model versions, rather than just two consecutive jumps).
We should expect regression towards the mean only if the tasks were selected for having high “improvement from small to Gopher-7”. Were they?