I find it super surprising that the tasks worked up until Gopher, but stopped working at PaLM. That’s such a narrow gap! That alone suggests some kind of interesting meta-level point re inverse scaling being rare, and that in fact the prize mostly picked up on the adverse selection of “the tasks that were inverse-y enough to not have issues on the models used.
One prediction this hypothesis makes is that people were overfitting to “what can GPT-3 not do” and thus that there’s a bunch of submitted tasks that were U-Shaped by Gopher, and the winning ones were just the ones that were U Shaped a bit beyond Gopher?
I’m also v curious how well these work on Chinchilla.
See this disclaimer on how they’ve modified our tasks (they’re finding u-shaped trends on a couple tasks that are different from the ones we found inverse scaling on, and they made some modifications that make the tasks easier)
Really interesting, thanks for sharing!
I find it super surprising that the tasks worked up until Gopher, but stopped working at PaLM. That’s such a narrow gap! That alone suggests some kind of interesting meta-level point re inverse scaling being rare, and that in fact the prize mostly picked up on the adverse selection of “the tasks that were inverse-y enough to not have issues on the models used.
One prediction this hypothesis makes is that people were overfitting to “what can GPT-3 not do” and thus that there’s a bunch of submitted tasks that were U-Shaped by Gopher, and the winning ones were just the ones that were U Shaped a bit beyond Gopher?
I’m also v curious how well these work on Chinchilla.
See this disclaimer on how they’ve modified our tasks (they’re finding u-shaped trends on a couple tasks that are different from the ones we found inverse scaling on, and they made some modifications that make the tasks easier)
Oh that’s sketchy af lol. Thanks!