Actually, my confusion is mostly on the source of the adverse selection pressure.
The only source of adverse selection pressure I understand is from multi-agent interactions in multipolar environments, but my model of agent foundations researchers do not expect multipolar outcomes by default/as the modal case, so I’m not really understanding the mechanism for adverse selection.
I listed some hypotheses as to where the source of adverse selection could come from, but I don’t have a compelling intuitive story for why they’d produce such adverse selection.
The John Wentworth argument that you are responding to is:
Goodhart’s Law means that proxies which might at first glance seem approximately-fine will break down when lots of optimization pressure is applied. And when we’re talking about aligning powerful future AI, we’re talking about a lot of optimization pressure. That’s the key idea which generalizes to other alignment strategies: crappy proxies won’t cut it when we start to apply a lot of optimization pressure.
What’s a proxy of corrigibility that you think might at first glance seem approximately-fine?
Actually, my confusion is mostly on the source of the adverse selection pressure.
The only source of adverse selection pressure I understand is from multi-agent interactions in multipolar environments, but my model of agent foundations researchers do not expect multipolar outcomes by default/as the modal case, so I’m not really understanding the mechanism for adverse selection.
I listed some hypotheses as to where the source of adverse selection could come from, but I don’t have a compelling intuitive story for why they’d produce such adverse selection.
Could you give an example of a desirable safety property where you are unsure what angle the adversarial pressure would come from?
Corrigibility is the first one that comes to mind.
The John Wentworth argument that you are responding to is:
What’s a proxy of corrigibility that you think might at first glance seem approximately-fine?
Obedience/deference seem the obvious proxies of corrigibility.
A proxy is supposed to be observable so that it can be used for the purpose it is to be used for.
What use do you have for a measure of corrigibility, and how do you intend to observe obedience/deference for that use?