The John Wentworth argument that you are responding to is:
Goodhart’s Law means that proxies which might at first glance seem approximately-fine will break down when lots of optimization pressure is applied. And when we’re talking about aligning powerful future AI, we’re talking about a lot of optimization pressure. That’s the key idea which generalizes to other alignment strategies: crappy proxies won’t cut it when we start to apply a lot of optimization pressure.
What’s a proxy of corrigibility that you think might at first glance seem approximately-fine?
The John Wentworth argument that you are responding to is:
What’s a proxy of corrigibility that you think might at first glance seem approximately-fine?
Obedience/deference seem the obvious proxies of corrigibility.
A proxy is supposed to be observable so that it can be used for the purpose it is to be used for.
What use do you have for a measure of corrigibility, and how do you intend to observe obedience/deference for that use?