The John Wentworth argument that you are responding to is:
Goodhart’s Law means that proxies which might at first glance seem approximately-fine will break down when lots of optimization pressure is applied. And when we’re talking about aligning powerful future AI, we’re talking about a lot of optimization pressure. That’s the key idea which generalizes to other alignment strategies: crappy proxies won’t cut it when we start to apply a lot of optimization pressure.
What’s a proxy of corrigibility that you think might at first glance seem approximately-fine?
Could you give an example of a desirable safety property where you are unsure what angle the adversarial pressure would come from?
Corrigibility is the first one that comes to mind.
The John Wentworth argument that you are responding to is:
What’s a proxy of corrigibility that you think might at first glance seem approximately-fine?
Obedience/deference seem the obvious proxies of corrigibility.
A proxy is supposed to be observable so that it can be used for the purpose it is to be used for.
What use do you have for a measure of corrigibility, and how do you intend to observe obedience/deference for that use?