DanielFilan comments on Deceptive failures short of full catastrophe.

DanielFilan 9 Mar 2023 23:06 UTC
3 points
0
Anthropic’s post on AI safety seems to use “scalable oversight” in the way I expect, to include the problem of the human not having enough info to provide feedback.

It may be that humans won’t be able to provide accurate/informed enough feedback to adequately train models to avoid harmful behavior across a wide range of circumstances. It may be that humans can be fooled by the AI system, and won’t be able to provide feedback that reflects what they actually want (e.g. accidentally providing positive feedback for misleading advice). It may be that the issue is a combination, and humans could provide correct feedback with enough effort, but can’t do so at scale. This is the problem of scalable oversight, and it seems likely to be a central issue in training safe, aligned AI systems. [bolding by DanielFilan]
- Alex Lawsen 11 Mar 2023 18:35 UTC
  3 points
  0
  Parent
  I think my suggest usage is slightly better but I’m not sure it’s worth the effort of trying to make people change, though I find ‘camouflage’ as a term useful when I’m trying to explain to people.