I certainly agree with the conclusion of the note:
All in all, great care must be used to maintain indifference with a self-improving AI.
‘Indifference’, remember, is just ‘U(S) = U(T )’ That is, a successful defection equals the value of the explosion going off? This is really problematic depending on how we cash out ‘defection’. For example, suppose defection involves taking over everything on Earth including the detonator—then defection looks better than taunting, since there are two avenues to ‘U(S)=U(T)’:
the AI takes over the detonator and pushing the detonator
or it wins in whatever the larger sense is
Another thought: I can’t quite follow all the formalism, but the key to indifference seems to be defining a set of worlds where the detonator set up remains unchanged. Doesn’t this encourage the AI to seize control of the detonator to make sure it does not change (ie. detonate)? (If we expand the set of detonator-worlds to include worlds where it has gone off and killed the AI, that might fix this issue, but adding in an explicit high valuing of successful suicide seems like it would not be very useful...)
The AI will never have any active desire to push the detonator (barring other reasons like someone saying “push the detonator, and I will donate $1 to the AI).
And this will not stop the AI from defecting, not at all. It will, however, ensure that while defecting, the detonator will not be a priority—it’s effectively just an inert lump from the AI’s persepective. So the AI will try and grab the nuclear missiles, or hack the president, or whatever, but the guy in the shed by the explosives is low down on the list. Maybe low down enough that they’ll be able to react on time.
The AI will never have any active desire to push the detonator (barring other reasons like someone saying “push the detonator, and I will donate $1 to the AI).
To reiterate what I said, if defection is of positive value to the AI and pushing the detonator is == defection as it seems to be, then pushing the detonator is of equal positive value.
The detonator going off is of equal value to the detonator not going off. But pushing the button is of no value to the AI.
You’re right that this elmininates the downside to the button being pushed, from the AI’s perspective, so the AI is more likely to indulge in behaviours that result in the button being pushed (such as defecting and not gaining control of the explosive setup as a priority). Thus it eliminates the potential of the explosives as a threat to keep in well behaved—it keeps them solely as a security for the human race.
If you want to preserve the threat aspect, the simplest would be to add a second batch of independent explosives, to whose detonation the AI is not indifferent, with the two batches being as seperate and as unconnected as possible. Then you would get both advantages in one.
I certainly agree with the conclusion of the note:
‘Indifference’, remember, is just ‘U(S) = U(T )’ That is, a successful defection equals the value of the explosion going off? This is really problematic depending on how we cash out ‘defection’. For example, suppose defection involves taking over everything on Earth including the detonator—then defection looks better than taunting, since there are two avenues to ‘U(S)=U(T)’:
the AI takes over the detonator and pushing the detonator
or it wins in whatever the larger sense is
Another thought: I can’t quite follow all the formalism, but the key to indifference seems to be defining a set of worlds where the detonator set up remains unchanged. Doesn’t this encourage the AI to seize control of the detonator to make sure it does not change (ie. detonate)? (If we expand the set of detonator-worlds to include worlds where it has gone off and killed the AI, that might fix this issue, but adding in an explicit high valuing of successful suicide seems like it would not be very useful...)
The AI will never have any active desire to push the detonator (barring other reasons like someone saying “push the detonator, and I will donate $1 to the AI).
And this will not stop the AI from defecting, not at all. It will, however, ensure that while defecting, the detonator will not be a priority—it’s effectively just an inert lump from the AI’s persepective. So the AI will try and grab the nuclear missiles, or hack the president, or whatever, but the guy in the shed by the explosives is low down on the list. Maybe low down enough that they’ll be able to react on time.
To reiterate what I said, if defection is of positive value to the AI and pushing the detonator is == defection as it seems to be, then pushing the detonator is of equal positive value.
The detonator going off is of equal value to the detonator not going off. But pushing the button is of no value to the AI.
You’re right that this elmininates the downside to the button being pushed, from the AI’s perspective, so the AI is more likely to indulge in behaviours that result in the button being pushed (such as defecting and not gaining control of the explosive setup as a priority). Thus it eliminates the potential of the explosives as a threat to keep in well behaved—it keeps them solely as a security for the human race.
If you want to preserve the threat aspect, the simplest would be to add a second batch of independent explosives, to whose detonation the AI is not indifferent, with the two batches being as seperate and as unconnected as possible. Then you would get both advantages in one.