Well-designed AIs don’t run on gratification, they run on planning. While it is theoretically possible to write an optimizer-type AI that cares only about the immediate reward in the next moment, and is completely neutral about human researchers shutting it down afterward, it’s not exactly trivial.
If I recall correctly, AIXI itself tries to optimize the total integrated reward from t = 0 to infinity, but it should be straightforward to introduce a cutoff after which point it doesn’t care.
But even with a planning horizon like that you have the problem that the AI wants to guarantee that it gets the maximum amount of reward. This means stopping the researchers in the lab from turning it off before its horizon runs out. As you reduce the length of the horizon (treating it as a parameter of the program), the AI has less time to think, in effect, and creates less and less elaborate defenses for its future self, until you set it to zero, at which point the AI won’t do anything at all (or act completely randomly, more likely).
This isn’t much of a solution though, because an AI with a really short planning horizon isn’t very useful in practice, and is still pretty dangerous if someone trying to use one thinks “this AI isn’t very effective, what if I let it plan further ahead” and increases the cutoff to a really huge value and the AI takes over the world again. There might be other solutions, but most of them would share that last caveat.
Well-designed AIs don’t run on gratification, they run on planning. While it is theoretically possible to write an optimizer-type AI that cares only about the immediate reward in the next moment, and is completely neutral about human researchers shutting it down afterward, it’s not exactly trivial.
If I recall correctly, AIXI itself tries to optimize the total integrated reward from
t = 0
to infinity, but it should be straightforward to introduce a cutoff after which point it doesn’t care.But even with a planning horizon like that you have the problem that the AI wants to guarantee that it gets the maximum amount of reward. This means stopping the researchers in the lab from turning it off before its horizon runs out. As you reduce the length of the horizon (treating it as a parameter of the program), the AI has less time to think, in effect, and creates less and less elaborate defenses for its future self, until you set it to zero, at which point the AI won’t do anything at all (or act completely randomly, more likely).
This isn’t much of a solution though, because an AI with a really short planning horizon isn’t very useful in practice, and is still pretty dangerous if someone trying to use one thinks “this AI isn’t very effective, what if I let it plan further ahead” and increases the cutoff to a really huge value and the AI takes over the world again. There might be other solutions, but most of them would share that last caveat.