I’m nervous about designing elaborate mechanisms to trick an AGI, since if we can’t even correctly implement an ordinary friendly AGI without bugs and mistakes, it seems even less likely we’d implement the weird/clever AGI setups without bugs and mistakes. I would tend to focus on just getting the AGI to behave properly from the start, without need for clever tricks, though I suppose that limited exploration into more fanciful scenarios might yield insight.
The AGI does not need to be tricked—it knows everything about the setup, it just doesn’t care. The point of this is that it allows a lot of extra control methods to be considered, if friendliness turns out to be as hard as we think.
Fair enough. I just meant that this setup requires building an AGI with a particular utility function that behaves as expected and building extra machinery around it, which could be more complicated than just building an AGI with the utility function you wanted. On the other hand, maybe it’s easier to build an AGI that only cares about worlds where one particular bitstring shows up than to build a friendly AGI in general.
One naive and useful security precaution is to only make the AI care about world where the high explosives inside it won’t actually ever detonate… (and place someone ready to blow them up if the AI misbehaves).
There are other, more general versions of that idea, and other uses to which this can be put.
I guess you mean that the AGI would care about worlds where the explosives won’t detonate even if the AGI does nothing to stop the person from pressing the detonation button. If the AGI only cared about worlds where the bomb didn’t detonate for any reason, it would try hard to stop the button from being pushed.
But to make the AGI care about only worlds where the bomb doesn’t go off even if it does nothing to avert the explosion, we have to define what it means for the AGI to “try to avert the explosion” vs. just doing ordinary actions. That gets pretty tricky pretty quickly.
Anyway, you’ve convinced me that these scenarios are at least interesting. I just want to point out that they may not be as straightforward as they seem once it comes time to implement them.
we have to define what it means for the AGI to “try to avert the explosion” vs. just doing ordinary actions. That gets pretty tricky pretty quickly.
We don’t actually have to do that. We set it up so the AI only cares about worlds in which a certain wire in the detonator doesn’t pass the signal through, so the AI has no need to act to remove the explosives or prevent the button from being pushed. Now, it may do those for other reasons, but not specifically to protect itself.
I’m nervous about designing elaborate mechanisms to trick an AGI, since if we can’t even correctly implement an ordinary friendly AGI without bugs and mistakes, it seems even less likely we’d implement the weird/clever AGI setups without bugs and mistakes. I would tend to focus on just getting the AGI to behave properly from the start, without need for clever tricks, though I suppose that limited exploration into more fanciful scenarios might yield insight.
The AGI does not need to be tricked—it knows everything about the setup, it just doesn’t care. The point of this is that it allows a lot of extra control methods to be considered, if friendliness turns out to be as hard as we think.
Fair enough. I just meant that this setup requires building an AGI with a particular utility function that behaves as expected and building extra machinery around it, which could be more complicated than just building an AGI with the utility function you wanted. On the other hand, maybe it’s easier to build an AGI that only cares about worlds where one particular bitstring shows up than to build a friendly AGI in general.
One naive and useful security precaution is to only make the AI care about world where the high explosives inside it won’t actually ever detonate… (and place someone ready to blow them up if the AI misbehaves).
There are other, more general versions of that idea, and other uses to which this can be put.
I guess you mean that the AGI would care about worlds where the explosives won’t detonate even if the AGI does nothing to stop the person from pressing the detonation button. If the AGI only cared about worlds where the bomb didn’t detonate for any reason, it would try hard to stop the button from being pushed.
But to make the AGI care about only worlds where the bomb doesn’t go off even if it does nothing to avert the explosion, we have to define what it means for the AGI to “try to avert the explosion” vs. just doing ordinary actions. That gets pretty tricky pretty quickly.
Anyway, you’ve convinced me that these scenarios are at least interesting. I just want to point out that they may not be as straightforward as they seem once it comes time to implement them.
We don’t actually have to do that. We set it up so the AI only cares about worlds in which a certain wire in the detonator doesn’t pass the signal through, so the AI has no need to act to remove the explosives or prevent the button from being pushed. Now, it may do those for other reasons, but not specifically to protect itself.
Or another example: an oracle that only cares about worlds in which its output message is not read: http://lesswrong.com/r/discussion/lw/mao/an_oracle_standard_trick/