GPT-4 (discussion) has been released and performs much better than PaLM/U-PaLM, and as predicted, there is also U-scaling with GPT-4 rather than GPT-3/GPT-3.5:
Some capabilities are still hard to predict. For example, the Inverse Scaling Prize was a competition to find a metric that gets worse as model compute increases, and “hindsight neglect” was one of the winners. Just like with another recent result, GPT-4 reverses the trend:
[Inverse Scaling Prize, hindsight neglect: GPT-4 goes to ~100%]
(Paper doesn’t seem to provide any additional information on inverse-scaling.)
GPT-4 (discussion) has been released and performs much better than PaLM/U-PaLM, and as predicted, there is also U-scaling with GPT-4 rather than GPT-3/GPT-3.5:
(Paper doesn’t seem to provide any additional information on inverse-scaling.)
It is not clear if this happened on its own, or if they deliberately trained the model not to make such mistakes.
Perhaps, in similar future studies, it is worth keeping half of the found tasks in secret in order to test future models with them.