Re: independent audits, although they’re not possible for this particular problem, there are many close variants of this problem such that independent audits are possible. Let’s think of human approval as a distorted view of our actual preferences, and our goal is to avoid things which are really bad according to our undistorted actual preferences. If we pass distorted human approval to our AI system, and the AI system avoids things which are really bad according to undistorted human approval, that suggests the system is robust to distortion.
For example:
Input your preferences extremely quickly, then see if the result is acceptable when you’re given more time to think about it.
Input your preferences while drunk, then see if the result is acceptable to your sober self.
Tell your friend they can only communicate using gestures. Have a 5-minute “conversation” with them, then go off and input their preferences as you understand them. See if they find the result acceptable.
Distort the inputs in code. This lets you test out a very wide range of distortion models and see which produce acceptable performance.
It would be helpful if people could outline some plausible-seeming scenarios for how divergence between approval and actual preferences could cause a catastrophe, in order to get a better sense for the appropriate noise model.
It would be helpful if people could outline some plausible-seeming scenarios for how divergence between approval and actual preferences could cause a catastrophe, in order to get a better sense for the appropriate noise model.
One scenario that comes to mind: an agent generates a manipulative output that is optimized to be approved by the programmers while causing the agent to seize control over more resources (in a way that is against the actual preferences of the programmers).
Re: independent audits, although they’re not possible for this particular problem, there are many close variants of this problem such that independent audits are possible. Let’s think of human approval as a distorted view of our actual preferences, and our goal is to avoid things which are really bad according to our undistorted actual preferences. If we pass distorted human approval to our AI system, and the AI system avoids things which are really bad according to undistorted human approval, that suggests the system is robust to distortion.
For example:
Input your preferences extremely quickly, then see if the result is acceptable when you’re given more time to think about it.
Input your preferences while drunk, then see if the result is acceptable to your sober self.
Tell your friend they can only communicate using gestures. Have a 5-minute “conversation” with them, then go off and input their preferences as you understand them. See if they find the result acceptable.
Distort the inputs in code. This lets you test out a very wide range of distortion models and see which produce acceptable performance.
It would be helpful if people could outline some plausible-seeming scenarios for how divergence between approval and actual preferences could cause a catastrophe, in order to get a better sense for the appropriate noise model.
One scenario that comes to mind: an agent generates a manipulative output that is optimized to be approved by the programmers while causing the agent to seize control over more resources (in a way that is against the actual preferences of the programmers).