Regarding Corrigibility as an alternative safety measure:
I think that exploring the Corrigibility concept sounds like a valuable thing to do. I also think that Corrigibility formalisms can be quite tricky (for similar reasons that Membrane formalisms can be tricky: I think that they are both vulnerable to difficult-to-notice definitional issues). Consider a powerful and clever tool-AI. It is built using a Corrigibility formalism that works very well when the tool-AI is used to shut down competing AI projects. This formalism relies on a definition of Explanation, that is designed to prevent any form of undue influence. When talking with this tool-AI about shutting down computing AI projects, the definition of Explanation holds up fine. In this scenario, it could be the case that asking this seemingly corrigible tool-AI about a Sovereign AI proposal, is essentially equivalent to implementing that proposal.
Any definition of Explanation will necessarily be built on top of a lot of assumptions. Many of these will be unexamined implicit assumptions that the designers will not be aware of. In general, it would not be particularly surprising if one of these assumptions turns out to hold when discussing things along the lines of shutting down competing AI projects. But turns out to break when discussing a Sovereign AI proposal.
Let’s take one specific example. Consider the case where the tool-AI will try to Explain any topic that it is asked about, until the person asking Understands the topic sufficiently. When asked about a Sovereign AI proposal, the tool-AI will ensure that two separate aspects of the proposal will be Understood, (i): an alignment target, and (ii): a normative moral theory according to which this alignment target is the thing that a Sovereign AI project should aim at. It turns out that Explaining a normative moral theory until the person asking Understands it, is functionally equivalent to convincing the person to adopt this normative moral theory. If the tool-AI is very good at convincing, then the tool-AI could be essentially equivalent to an AI that will implement whatever Sovereign AI proposal it is first asked to explain (with a few extra steps).
Yes, in my discussions with Max Harms about CAST we discussed the concern of a highly capable corrigible tool-AI accidentally or intentionally manipulating its operators or other humans with very compelling answers to questions. My impression is that Max is more confident about his version of corrigibility managing to avoid manipulation scenarios than I am. I think this is definitely one of the more fragile and slippery aspects of corrigibility. In my opinion, manipulation-prevention in the context of corrigibility deserves more examination to see if better protections can be found, and a very cautious treatment during any deployment of a powerful corrigible tool-AI.
Regarding Corrigibility as an alternative safety measure:
I think that exploring the Corrigibility concept sounds like a valuable thing to do. I also think that Corrigibility formalisms can be quite tricky (for similar reasons that Membrane formalisms can be tricky: I think that they are both vulnerable to difficult-to-notice definitional issues). Consider a powerful and clever tool-AI. It is built using a Corrigibility formalism that works very well when the tool-AI is used to shut down competing AI projects. This formalism relies on a definition of Explanation, that is designed to prevent any form of undue influence. When talking with this tool-AI about shutting down computing AI projects, the definition of Explanation holds up fine. In this scenario, it could be the case that asking this seemingly corrigible tool-AI about a Sovereign AI proposal, is essentially equivalent to implementing that proposal.
Any definition of Explanation will necessarily be built on top of a lot of assumptions. Many of these will be unexamined implicit assumptions that the designers will not be aware of. In general, it would not be particularly surprising if one of these assumptions turns out to hold when discussing things along the lines of shutting down competing AI projects. But turns out to break when discussing a Sovereign AI proposal.
Let’s take one specific example. Consider the case where the tool-AI will try to Explain any topic that it is asked about, until the person asking Understands the topic sufficiently. When asked about a Sovereign AI proposal, the tool-AI will ensure that two separate aspects of the proposal will be Understood, (i): an alignment target, and (ii): a normative moral theory according to which this alignment target is the thing that a Sovereign AI project should aim at. It turns out that Explaining a normative moral theory until the person asking Understands it, is functionally equivalent to convincing the person to adopt this normative moral theory. If the tool-AI is very good at convincing, then the tool-AI could be essentially equivalent to an AI that will implement whatever Sovereign AI proposal it is first asked to explain (with a few extra steps).
(I discussed this issue with Max Harms here)
Yes, in my discussions with Max Harms about CAST we discussed the concern of a highly capable corrigible tool-AI accidentally or intentionally manipulating its operators or other humans with very compelling answers to questions. My impression is that Max is more confident about his version of corrigibility managing to avoid manipulation scenarios than I am. I think this is definitely one of the more fragile and slippery aspects of corrigibility. In my opinion, manipulation-prevention in the context of corrigibility deserves more examination to see if better protections can be found, and a very cautious treatment during any deployment of a powerful corrigible tool-AI.