The authors have updated their arXiv paper based on my feedback, and I’m happy with the evaluation setup now: https://arxiv.org/abs/2211.02011v2. They’re showing that scaling PALM gives u-shaped scaling on 2/4 tasks (rather than 3/4 in the earlier version) and inverse scaling on 2/4 tasks. I personally found this result at least somewhat surprising, given the fairly consistent inverse scaling we found across various model series’ we tried. They’re also finding that inverse scaling on these tasks goes away with chain-of-thought prompting, which I think is a neat finding (and nice to see some success from visible-thoughts-style methods here). After this paper, I’m pretty interested to know:
what PALM scaling laws look like for Round 2 inverse scaling tasks
if inverse scaling continues on the other 2 tasks Round 1 tasks
if there are tasks where even chain-of-thought leads to inverse scaling
They’re also finding that inverse scaling on these tasks goes away with chain-of-thought prompting
So, like some of the Big-Bench PaLM results, these are more cases of ‘hidden scaling’ where quite simple inner-monologue approaches can show smooth scaling while the naive pre-existing benchmark claims that there are no gains with scale?
The authors have updated their arXiv paper based on my feedback, and I’m happy with the evaluation setup now: https://arxiv.org/abs/2211.02011v2. They’re showing that scaling PALM gives u-shaped scaling on 2/4 tasks (rather than 3/4 in the earlier version) and inverse scaling on 2/4 tasks. I personally found this result at least somewhat surprising, given the fairly consistent inverse scaling we found across various model series’ we tried. They’re also finding that inverse scaling on these tasks goes away with chain-of-thought prompting, which I think is a neat finding (and nice to see some success from visible-thoughts-style methods here). After this paper, I’m pretty interested to know:
what PALM scaling laws look like for Round 2 inverse scaling tasks
if inverse scaling continues on the other 2 tasks Round 1 tasks
if there are tasks where even chain-of-thought leads to inverse scaling
So, like some of the Big-Bench PaLM results, these are more cases of ‘hidden scaling’ where quite simple inner-monologue approaches can show smooth scaling while the naive pre-existing benchmark claims that there are no gains with scale?
Yup