I’m not sure about whether corrigibility is a natural abstraction. It’s at least plausible, and if it is, then corrigibility by default should work under basically-similar assumptions.
Under your theory, is this not an issue because of the sparsity of natural abstractions near human values?
Basically, yes. We want the system to use its actual model of human values as a proxy for its objective, which is itself a proxy for human values. So the whole strategy will fall apart in situations where the system converges to the true optimum of its objective. But in situations where a proxy for the system’s true optimum would be used (e.g. weak optimization or insufficient data to separate proxy from true), the model of human values may be the best available proxy.
I’m not sure about whether corrigibility is a natural abstraction. It’s at least plausible, and if it is, then corrigibility by default should work under basically-similar assumptions.
Basically, yes. We want the system to use its actual model of human values as a proxy for its objective, which is itself a proxy for human values. So the whole strategy will fall apart in situations where the system converges to the true optimum of its objective. But in situations where a proxy for the system’s true optimum would be used (e.g. weak optimization or insufficient data to separate proxy from true), the model of human values may be the best available proxy.