it is likely—nay, probable—that several other of the current ‘inverse scaling’ examples are actually U-shaped and simply aren’t tested with models like Flan-U-PaLM or GPT-4 or further future models that solve them
“Inverse scaling can become U-shaped” has been updated (v3), showing PaLM has U-shaped scaling on 11 previously-inverse-scaling tasks taken from here, and if I’m reading it right, there’s only 1 inverse-scaling task which PaLM doesn’t U-shape on:
Limitations: Note the broad emergence of U-shaped scaling across these tasks does not mean that the Inverse Scaling Benchmark is solved. This is because although PaLM 540B* increases performance compared to PaLM 62B, it often still does not do much better than random performance, as is the case for five of the nine U-shaped scaling tasks with accuracy as the evaluation metric. Hence, there is an opportunity for further research to find a way for models to perform better than random on these tasks. Additionally, the Redefine Math task is inverse scaling for all models families tested.
So, one task is still a holdout, and the U-shaped scaling hasn’t yet brought performance up to a desirable level, but overall, I regard this as resolving inverse scaling: not particularly important other than as a cautionary lesson in extrapolation & hidden scaling, and ‘scale is (still) all you need’.
* Wei confirms that this is not Flan or U-PaLM, just the plain original PaLM. So it’s possible that those U-curve ‘Redefine Math’ or improve the overall scaling substantially.
GPT-4 (discussion) has been released and performs much better than PaLM/U-PaLM, and as predicted, there is also U-scaling with GPT-4 rather than GPT-3/GPT-3.5:
Some capabilities are still hard to predict. For example, the Inverse Scaling Prize was a competition to find a metric that gets worse as model compute increases, and “hindsight neglect” was one of the winners. Just like with another recent result, GPT-4 reverses the trend:
[Inverse Scaling Prize, hindsight neglect: GPT-4 goes to ~100%]
(Paper doesn’t seem to provide any additional information on inverse-scaling.)
“Inverse scaling can become U-shaped” has been updated (v3), showing PaLM has U-shaped scaling on 11 previously-inverse-scaling tasks taken from here, and if I’m reading it right, there’s only 1 inverse-scaling task which PaLM doesn’t U-shape on:
So, one task is still a holdout, and the U-shaped scaling hasn’t yet brought performance up to a desirable level, but overall, I regard this as resolving inverse scaling: not particularly important other than as a cautionary lesson in extrapolation & hidden scaling, and ‘scale is (still) all you need’.
(Also notable: inner monologue results)
* Wei confirms that this is not Flan or U-PaLM, just the plain original PaLM. So it’s possible that those U-curve ‘Redefine Math’ or improve the overall scaling substantially.
GPT-4 (discussion) has been released and performs much better than PaLM/U-PaLM, and as predicted, there is also U-scaling with GPT-4 rather than GPT-3/GPT-3.5:
(Paper doesn’t seem to provide any additional information on inverse-scaling.)
It is not clear if this happened on its own, or if they deliberately trained the model not to make such mistakes.
Perhaps, in similar future studies, it is worth keeping half of the found tasks in secret in order to test future models with them.