What would a corrigible but not-intent-aligned AI system look like?
Suppose that I think you know me well and I want you to act autonomously on my behalf using your best guesses. Then you can be intent aligned without being corrigible. Indeed, I may even prefer that you be incorrigible, e.g. if I want your behavior to be predictable to others. If the agent knows that I have such a preference then it can’t be both corrigible and intent aligned.
Suppose that I think you know me well and I want you to act autonomously on my behalf using your best guesses. Then you can be intent aligned without being corrigible. Indeed, I may even prefer that you be incorrigible, e.g. if I want your behavior to be predictable to others. If the agent knows that I have such a preference then it can’t be both corrigible and intent aligned.