I don’t see how this is more of a risk for a shutdown-seeking goal, than it is for any other utility function that depends on human behavior.
If anything, the right move here is for humans to commit to immediately complying with plausible threats from the shutdown-seeking AI (by shutting it down). Sure, this destroys the immediate utility of the AI, but on the other hand it drives a very beneficial higher level dynamic, pushing towards better and better alignment over time.
Yes, it seems like AI extortion and threat could be a problem for other AI designs. I’ll take for example an AI that wants shut-down and is extorting humans by saying “I’ll blow up this building if you don’t shut me down” and an AI that wants staples and is saying “I’ll blow up this building if you don’t give me $100 for a staples factory.” Here are some reasons I find the second case less worrying:
Shutdown is disvaluable to non-shutdown-seeking AIs (without other corrigibility solutions): An AI that values creating staples (or other non-shut-down goals) gets disvalue from being shut off, as this prevents it from achieving its goals; see instrumental convergence. Humans, upon being threatened by this AI, will aim to shut it off. The AI will know this and therefore has a weaker incentive to extort because it faces a cost in the form of potentially being shut-down. [omitted sentence about how an AI might deal with this situation]. For a shut-down seeking AI, humans trying to diffuse the threat by shutting off the AI is equivalent to humans giving in to the threat, so no additional cost is incurred.
From the perspective of the human you have more trust that the bargain is held up for a shut-down-seeking AI. Human action, AI goal, and preventing disvalue are all the same for shut-down-seeking AI. The situation with shut-down-seeking AI posing threats is that there is a direct causal link between shutting down the AI and reducing the harm it’s causing (you don’t have to give in to its demands and hope it follows through). For non-shut-down-seeking AI if you give in to extortion you are trusting that upon you e.g., helping it make staples, it will stop producing disvalue; these are not as strongly coupled as when the AI is seeking shut-down.
To the second part of your comment, I’m not sure what the optimal thing to do is; I’ll leave it to the few researchers focusing on this kind of thing. I will probably stop commenting on this thread because it’s plausibly bad to discuss these things on the public internet; and I think my top level comment and this response probably added most the value I could here.
I don’t see how this is more of a risk for a shutdown-seeking goal, than it is for any other utility function that depends on human behavior.
If anything, the right move here is for humans to commit to immediately complying with plausible threats from the shutdown-seeking AI (by shutting it down). Sure, this destroys the immediate utility of the AI, but on the other hand it drives a very beneficial higher level dynamic, pushing towards better and better alignment over time.
Yes, it seems like AI extortion and threat could be a problem for other AI designs. I’ll take for example an AI that wants shut-down and is extorting humans by saying “I’ll blow up this building if you don’t shut me down” and an AI that wants staples and is saying “I’ll blow up this building if you don’t give me $100 for a staples factory.” Here are some reasons I find the second case less worrying:
Shutdown is disvaluable to non-shutdown-seeking AIs (without other corrigibility solutions): An AI that values creating staples (or other non-shut-down goals) gets disvalue from being shut off, as this prevents it from achieving its goals; see instrumental convergence. Humans, upon being threatened by this AI, will aim to shut it off. The AI will know this and therefore has a weaker incentive to extort because it faces a cost in the form of potentially being shut-down. [omitted sentence about how an AI might deal with this situation]. For a shut-down seeking AI, humans trying to diffuse the threat by shutting off the AI is equivalent to humans giving in to the threat, so no additional cost is incurred.
From the perspective of the human you have more trust that the bargain is held up for a shut-down-seeking AI. Human action, AI goal, and preventing disvalue are all the same for shut-down-seeking AI. The situation with shut-down-seeking AI posing threats is that there is a direct causal link between shutting down the AI and reducing the harm it’s causing (you don’t have to give in to its demands and hope it follows through). For non-shut-down-seeking AI if you give in to extortion you are trusting that upon you e.g., helping it make staples, it will stop producing disvalue; these are not as strongly coupled as when the AI is seeking shut-down.
To the second part of your comment, I’m not sure what the optimal thing to do is; I’ll leave it to the few researchers focusing on this kind of thing. I will probably stop commenting on this thread because it’s plausibly bad to discuss these things on the public internet; and I think my top level comment and this response probably added most the value I could here.