I’m a research scientist at Anthropic doing empirical safety research on language models. In the past, I’ve worked on automated red teaming of language models [1], the inverse scaling prize [2], learning from human feedback [3][4], and empirically testing debate [5][6], iterated amplification [7], and other methods [8] for scalably supervising AI systems as they become more capable.
Website: https://ethanperez.net/
The authors have updated their arXiv paper based on my feedback, and I’m happy with the evaluation setup now: https://arxiv.org/abs/2211.02011v2. They’re showing that scaling PALM gives u-shaped scaling on 2/4 tasks (rather than 3/4 in the earlier version) and inverse scaling on 2/4 tasks. I personally found this result at least somewhat surprising, given the fairly consistent inverse scaling we found across various model series’ we tried. They’re also finding that inverse scaling on these tasks goes away with chain-of-thought prompting, which I think is a neat finding (and nice to see some success from visible-thoughts-style methods here). After this paper, I’m pretty interested to know:
what PALM scaling laws look like for Round 2 inverse scaling tasks
if inverse scaling continues on the other 2 tasks Round 1 tasks
if there are tasks where even chain-of-thought leads to inverse scaling