I made a relevant comment back in May, which still sounds neat but I have no idea if it’s correct:
Rolf Nelson’s AI deterrence doesn’t work for Schellingian reasons: the Rogue AI has incentive to modify itself to not understand such threats before it first looks at the outside world. This makes you unable to threaten, because when you simulate the Rogue AI you will see its precommitment first. So the Rogue AI negates your “first mover advantage” by becoming the first mover in your simulation :-)
Unless you intend to carry out the threat whether or not you expect the AI to modify itself to not understand it, in which case an AI may realize you would do this and decide not to modify itself after all.
Let Pg be RAI’s estimation of probability of successful completion of computation without conversion of Earth into computronium.
Let Pb be RAI’s estimation of probability of success if it converts Earth into computronium.
Case 1. RAI modifies itself into non-understanding of simulation threats.
When RAI is not simulated, it ignores all threats and proceeds to converting Earth into computronium, thus increasing its probability of success by Pb-Pg (it becomes protected from accidental/intentional turning-offs, etc.).
When RAI is simulated, it fails bargaining and its probability of success becomes 0.
Overall estimated probability of success: Pb times RAI’s estimation of probability that it isn’t being simulated (Pn).
Case 2. RAI doesn’t modify itself.
Overall estimated probability of success: Pg.
Thus it is rational for RAI to modify itself iff Pb*Pn > Pg. Pn is defined by RAI’s prior and cannot be updated in perfect simulation and in real world. Conclusion: don’t build AI which doesn’t believe simulation argument.
I made a relevant comment back in May, which still sounds neat but I have no idea if it’s correct:
Unless you intend to carry out the threat whether or not you expect the AI to modify itself to not understand it, in which case an AI may realize you would do this and decide not to modify itself after all.
Let Pg be RAI’s estimation of probability of successful completion of computation without conversion of Earth into computronium. Let Pb be RAI’s estimation of probability of success if it converts Earth into computronium.
Case 1. RAI modifies itself into non-understanding of simulation threats.
When RAI is not simulated, it ignores all threats and proceeds to converting Earth into computronium, thus increasing its probability of success by Pb-Pg (it becomes protected from accidental/intentional turning-offs, etc.).
When RAI is simulated, it fails bargaining and its probability of success becomes 0.
Overall estimated probability of success: Pb times RAI’s estimation of probability that it isn’t being simulated (Pn).
Case 2. RAI doesn’t modify itself.
Overall estimated probability of success: Pg.
Thus it is rational for RAI to modify itself iff Pb*Pn > Pg. Pn is defined by RAI’s prior and cannot be updated in perfect simulation and in real world. Conclusion: don’t build AI which doesn’t believe simulation argument.