Good question. I think there’s a large overlap between them, including most of the important/scary cases that don’t involve deceptive alignment (which are usually both). I think listing examples feels like the easiest way of explaining where they come apart: - There’s some kinds of ‘oversight failure’ which aren’t ‘scalable oversight failure’ e.g. the ball grabbing robot hand thing. I don’t think the problem here was oversight simply failing to scale to superhuman. This does count as camouflage. - There’s also some kinds of scalable oversight failure where the issue looks more like ‘we didn’t try at all’ than ‘we tried, but selecting based only on what we could see screwed us’. Someone just deciding to deploy a system and essentially just hoping that it’s aligned would fall into this camp, but a more realistic case would be something like only evaluating a system based on its immediate effects, and then the long-run effects being terrible. You might not consider this a ‘failure of scalable oversight’, and instead want to call it a ‘failure to even try scalable oversight’, but I think the line is blurry—maybe people tried some scalable oversight stuff, it didn’t really work, and then they gave up and said ‘short term is probably fine’. - I think most failures of scalable oversight have some story which roughly goes “people tried to select for things that would be good, and instead got things that looked like they would be good to the overseer”. These count as both.
There’s some kinds of ‘oversight failure’ which aren’t ‘scalable oversight failure’ e.g. the ball grabbing robot hand thing. I don’t think the problem here was oversight simply failing to scale to superhuman.
Huh—I would have called it a “scalable oversight failure”, but am now persuaded that that’s a bad term to use, and “camouflage failure” works better.
Anthropic’s post on AI safety seems to use “scalable oversight” in the way I expect, to include the problem of the human not having enough info to provide feedback.
It may be that humans won’t be able to provide accurate/informed enough feedback to adequately train models to avoid harmful behavior across a wide range of circumstances. It may be that humans can be fooled by the AI system, and won’t be able to provide feedback that reflects what they actually want (e.g. accidentally providing positive feedback for misleading advice). It may be that the issue is a combination, and humans could provide correct feedback with enough effort, but can’t do so at scale. This is the problem of scalable oversight, and it seems likely to be a central issue in training safe, aligned AI systems. [bolding by DanielFilan]
I think my suggest usage is slightly better but I’m not sure it’s worth the effort of trying to make people change, though I find ‘camouflage’ as a term useful when I’m trying to explain to people.
Good question. I think there’s a large overlap between them, including most of the important/scary cases that don’t involve deceptive alignment (which are usually both). I think listing examples feels like the easiest way of explaining where they come apart:
- There’s some kinds of ‘oversight failure’ which aren’t ‘scalable oversight failure’ e.g. the ball grabbing robot hand thing. I don’t think the problem here was oversight simply failing to scale to superhuman. This does count as camouflage.
- There’s also some kinds of scalable oversight failure where the issue looks more like ‘we didn’t try at all’ than ‘we tried, but selecting based only on what we could see screwed us’. Someone just deciding to deploy a system and essentially just hoping that it’s aligned would fall into this camp, but a more realistic case would be something like only evaluating a system based on its immediate effects, and then the long-run effects being terrible. You might not consider this a ‘failure of scalable oversight’, and instead want to call it a ‘failure to even try scalable oversight’, but I think the line is blurry—maybe people tried some scalable oversight stuff, it didn’t really work, and then they gave up and said ‘short term is probably fine’.
- I think most failures of scalable oversight have some story which roughly goes “people tried to select for things that would be good, and instead got things that looked like they would be good to the overseer”. These count as both.
Huh—I would have called it a “scalable oversight failure”, but am now persuaded that that’s a bad term to use, and “camouflage failure” works better.
Anthropic’s post on AI safety seems to use “scalable oversight” in the way I expect, to include the problem of the human not having enough info to provide feedback.
I think my suggest usage is slightly better but I’m not sure it’s worth the effort of trying to make people change, though I find ‘camouflage’ as a term useful when I’m trying to explain to people.