There’s some kinds of ‘oversight failure’ which aren’t ‘scalable oversight failure’ e.g. the ball grabbing robot hand thing. I don’t think the problem here was oversight simply failing to scale to superhuman.
Huh—I would have called it a “scalable oversight failure”, but am now persuaded that that’s a bad term to use, and “camouflage failure” works better.
Anthropic’s post on AI safety seems to use “scalable oversight” in the way I expect, to include the problem of the human not having enough info to provide feedback.
It may be that humans won’t be able to provide accurate/informed enough feedback to adequately train models to avoid harmful behavior across a wide range of circumstances. It may be that humans can be fooled by the AI system, and won’t be able to provide feedback that reflects what they actually want (e.g. accidentally providing positive feedback for misleading advice). It may be that the issue is a combination, and humans could provide correct feedback with enough effort, but can’t do so at scale. This is the problem of scalable oversight, and it seems likely to be a central issue in training safe, aligned AI systems. [bolding by DanielFilan]
I think my suggest usage is slightly better but I’m not sure it’s worth the effort of trying to make people change, though I find ‘camouflage’ as a term useful when I’m trying to explain to people.
Huh—I would have called it a “scalable oversight failure”, but am now persuaded that that’s a bad term to use, and “camouflage failure” works better.
Anthropic’s post on AI safety seems to use “scalable oversight” in the way I expect, to include the problem of the human not having enough info to provide feedback.
I think my suggest usage is slightly better but I’m not sure it’s worth the effort of trying to make people change, though I find ‘camouflage’ as a term useful when I’m trying to explain to people.