I was trying to say that the gap between the two did not decrease with scale. Of course, raw performance increases with scale as gwern & others would be happy to see :)
Yes, that was my take away. You expect a gap but there is no particular reason to expect the gap to close with scale, because that would require critique to scale better than discrimination, and why would you expect that rather than scaling similarly (maintaining a gap) or diverging in the other direction (discriminating scaling better than critiquing)?
I think the gap itself is mildly interesting in a “it knows more than it can say” deception sort of way, but we already knew similar things from stuff like prompt programming for buggy Codex code completions. Since the knowledge must be there in the model, and it is brought out by fairly modest scaling (a larger model can explain what a smaller model detects), I would guess that it wouldn’t be too hard to improve the critique with the standard tricks like generating a lot of completions & scoring for the best one (which they show does help a lot) and better prompting (inner-monologue seems like an obvious trick to apply to get it to fisk the summary: “let’s explain step by step why this is wrong, starting with the key quote: ”). The gap will only be interesting if it proves immune to the whole arsenal. If it isn’t, then it’s just another “sampling can prove the presence of knowledge but not the absence”.
Otherwise, this looks like a lot of results any pro-scaling advocate would be unsurprised to see: yet another task with apparently smooth improvement with model size*, some capabilities emerging with larger but not smaller models (“We also find that large models are able to directly improve their outputs, using their self-critiques, which small models are unable to do. Using better critiques helps models make better improvements than they do with worse critiques, or with no critiques.”) at unpredicted sizes requiring empirical testing, big performance boosts from better sampling procedures than naive greedy sampling, interesting nascent bootstrapping effects...
* did I miss something or does this completely omit any mention of parameter sizes and only talks in terms of model loss?
I was trying to say that the gap between the two did not decrease with scale. Of course, raw performance increases with scale as gwern & others would be happy to see :)
Yes, that was my take away. You expect a gap but there is no particular reason to expect the gap to close with scale, because that would require critique to scale better than discrimination, and why would you expect that rather than scaling similarly (maintaining a gap) or diverging in the other direction (discriminating scaling better than critiquing)?
I think the gap itself is mildly interesting in a “it knows more than it can say” deception sort of way, but we already knew similar things from stuff like prompt programming for buggy Codex code completions. Since the knowledge must be there in the model, and it is brought out by fairly modest scaling (a larger model can explain what a smaller model detects), I would guess that it wouldn’t be too hard to improve the critique with the standard tricks like generating a lot of completions & scoring for the best one (which they show does help a lot) and better prompting (inner-monologue seems like an obvious trick to apply to get it to fisk the summary: “let’s explain step by step why this is wrong, starting with the key quote: ”). The gap will only be interesting if it proves immune to the whole arsenal. If it isn’t, then it’s just another “sampling can prove the presence of knowledge but not the absence”.
Otherwise, this looks like a lot of results any pro-scaling advocate would be unsurprised to see: yet another task with apparently smooth improvement with model size*, some capabilities emerging with larger but not smaller models (“We also find that large models are able to directly improve their outputs, using their self-critiques, which small models are unable to do. Using better critiques helps models make better improvements than they do with worse critiques, or with no critiques.”) at unpredicted sizes requiring empirical testing, big performance boosts from better sampling procedures than naive greedy sampling, interesting nascent bootstrapping effects...
* did I miss something or does this completely omit any mention of parameter sizes and only talks in terms of model loss?