It would be helpful if people could outline some plausible-seeming scenarios for how divergence between approval and actual preferences could cause a catastrophe, in order to get a better sense for the appropriate noise model.
One scenario that comes to mind: an agent generates a manipulative output that is optimized to be approved by the programmers while causing the agent to seize control over more resources (in a way that is against the actual preferences of the programmers).
One scenario that comes to mind: an agent generates a manipulative output that is optimized to be approved by the programmers while causing the agent to seize control over more resources (in a way that is against the actual preferences of the programmers).