Based off what you’ve said in the comments, I’m guessing you’d say the various forms of corrigibility are natural abstractions. Would you say we can use the strategy you outline here to get “corrigibility by default”?
Regarding iterations, the common objection is that we’re introducing optimisation pressure. So we should expect the usual alignment issues anyway. Under your theory, is this not an issue because of the sparsity of natural abstractions near human values?
I’m not sure about whether corrigibility is a natural abstraction. It’s at least plausible, and if it is, then corrigibility by default should work under basically-similar assumptions.
Under your theory, is this not an issue because of the sparsity of natural abstractions near human values?
Basically, yes. We want the system to use its actual model of human values as a proxy for its objective, which is itself a proxy for human values. So the whole strategy will fall apart in situations where the system converges to the true optimum of its objective. But in situations where a proxy for the system’s true optimum would be used (e.g. weak optimization or insufficient data to separate proxy from true), the model of human values may be the best available proxy.
Based off what you’ve said in the comments, I’m guessing you’d say the various forms of corrigibility are natural abstractions. Would you say we can use the strategy you outline here to get “corrigibility by default”?
Regarding iterations, the common objection is that we’re introducing optimisation pressure. So we should expect the usual alignment issues anyway. Under your theory, is this not an issue because of the sparsity of natural abstractions near human values?
I’m not sure about whether corrigibility is a natural abstraction. It’s at least plausible, and if it is, then corrigibility by default should work under basically-similar assumptions.
Basically, yes. We want the system to use its actual model of human values as a proxy for its objective, which is itself a proxy for human values. So the whole strategy will fall apart in situations where the system converges to the true optimum of its objective. But in situations where a proxy for the system’s true optimum would be used (e.g. weak optimization or insufficient data to separate proxy from true), the model of human values may be the best available proxy.