I’m not sure why you mean by ‘philosophically’ simple?
I think if we had the right conception of goals, the difference between ‘corrigibility’ and ‘incorrigibility’ would be a short sentence in that language. (For example, if you have a causal graph that goes from “the state of the world” to “my observations”, you specify what you want in terms of the link between the state of the world and your observations, instead of the observations.)
This is in contrast to, like, ‘practically simple’, where you’ve programmed in rules to not do any of the ten thousand things it could do to corrupt things.
I think if we had the right conception of goals, the difference between ‘corrigibility’ and ‘incorrigibility’ would be a short sentence in that language. (For example, if you have a causal graph that goes from “the state of the world” to “my observations”, you specify what you want in terms of the link between the state of the world and your observations, instead of the observations.)
This is in contrast to, like, ‘practically simple’, where you’ve programmed in rules to not do any of the ten thousand things it could do to corrupt things.