It seems like an underlying assumption of this post is that any useful safety property like “corrigibility” must be about outcomes of an AI acting in the world, whereas my understanding of (Paul’s version of) corrigibility is that it is also about the motivations underlying the AI’s actions. It’s certainly true that we don’t have a good definition of what an AI’s “motivation” is, and we don’t have a good way of testing whether the AI has “bad motivations”, but this seems like a tractable problem? In addition, maybe we can make claims of the form “this training procedure motivates the AI to help us and not manipulate us”.
I think of corrigibility as “wanting to help humans” (see here) plus some requirements on the capability of the AI (for example, it “knows” that a good way to help humans is to help them understand its true reasoning, and it “knows” that it could be wrong about what humans value). In the “teach me about charities” example, I think basically any of the behaviors you describe are corrigible, if the AI has no ulterior motive behind it. For example, trying to convince the billionaire to focus on administrative costs because then it would be easier for the AI to evaluate which charities are good or not is incorrigible. However, talking to the billionaire to focus on administrative costs because the AI has noticed that the billionaire is very frugal would be corrigible. (Though ideally the AI would mention all of the options that it sees the billionaire being convinced by, and then asks the billionaire for input on which method of convincing him he would endorse.) I agree that testing corrigibility in such a scenario is hard (though I like Paul’s comment above as an idea for that), but it seems like we can train an agent in such a way that the optimization will knowably (i.e. high but not proof-level confidence) create an AI that is corrigible.
It seems like an underlying assumption of this post is that any useful safety property like “corrigibility” must be about outcomes of an AI acting in the world, whereas my understanding of (Paul’s version of) corrigibility is that it is also about the motivations underlying the AI’s actions. It’s certainly true that we don’t have a good definition of what an AI’s “motivation” is, and we don’t have a good way of testing whether the AI has “bad motivations”, but this seems like a tractable problem? In addition, maybe we can make claims of the form “this training procedure motivates the AI to help us and not manipulate us”.
I think of corrigibility as “wanting to help humans” (see here) plus some requirements on the capability of the AI (for example, it “knows” that a good way to help humans is to help them understand its true reasoning, and it “knows” that it could be wrong about what humans value). In the “teach me about charities” example, I think basically any of the behaviors you describe are corrigible, if the AI has no ulterior motive behind it. For example, trying to convince the billionaire to focus on administrative costs because then it would be easier for the AI to evaluate which charities are good or not is incorrigible. However, talking to the billionaire to focus on administrative costs because the AI has noticed that the billionaire is very frugal would be corrigible. (Though ideally the AI would mention all of the options that it sees the billionaire being convinced by, and then asks the billionaire for input on which method of convincing him he would endorse.) I agree that testing corrigibility in such a scenario is hard (though I like Paul’s comment above as an idea for that), but it seems like we can train an agent in such a way that the optimization will knowably (i.e. high but not proof-level confidence) create an AI that is corrigible.