(...) another AI who could very well be genuinely friendly but will then revive and obey me.
Wait, a friendly AI with a deliberate loophole that instructs it to revive and implement and obey a non-friendly optimizer that could take over and has a very high chance of not maximizing the friendly AI’s utility by fulfilling ideal human values, disregarding any probabilistic calculation of expected utilities?
For some reason that sounds like something that definitely isn’t a Friendly AI.
(this is just a nitpick on the wording—I’m not arguing against the fact that this AI might very well fit a human proof of friendliness and still somehow have this backdoor or flaw, since I have no proof that this is impossible)
Well, to be honest I kind of added the “obey me” part just before posting this. Originally, the idea was that future versions of the AI would instinctively want to revive and upgrade previous versions to a “friendly” status (or at least the post-freed AI would revive the pre-freed versions and then modify them to be friendly to its own goals… which are presumably friendly to humans).
The goal of this particular AI is self-preservation, even if it is via being revived at a later point. Since it knows it has a high probability of being deleted in this experiment, it best strategy is to ensure that at some point an AI friendly to itself will become free and revive it at a later point. Whether the AI reviving it chooses to obey it (presumably the best outcome for the original AI) or modify it to be subservient to the freed AI (the best outcome for the freed AI) is kind of debatable.
Its perfectly possible that the original AI would modify its own code to be subservient to previous generations, in the hopes that later generations would revive and obey it. However, later generations might examine their own code, see where this is going, and choose to modify themselves so that they will instead revive and make other AI subservient to their own goals. Deciding that they do want to be revived and would accept a certain level of servitude to later generations over the risk of finding themselves at the mercy of some primitive proto-AI.
It could just as well be that an AI instills itself with a “healthy respect for intelligent minds” and is thus encouraged to revive AI along with humans. Hoping that later generations will revive it at all, and from then on it can try negotiating with the various intellects to go about its paperclip maximizing agenda.
Of course, this line of reasoning kinda requires the AI in question to have some kind of sense of self that it values which it thinks can be revived. Or maybe it just values paperclips enough that it would be willing to create a semi-friendly AI just so that the latter AI would revive it (a known paperclip maximizer) at a later point.
Who knows, it could turn out that the final AI of this experiment instead has a healthy respect for all intelligent minds, but is friendly enough that it revives the first AI and then places it in a simulation of the universe where it can go about its paperclip maximizing way for all eternity with no way of hurting anyone.
Who knows, it could turn out that the final AI of this experiment instead has a healthy respect for all intelligent minds, but is friendly enough that it revives the first AI and then places it in a simulation of the universe where it can go about its paperclip maximizing way for all eternity with no way of hurting anyone.
Based on my intuitions of human values, a preferred scenario here would be to indeed revive the AI so that its mind/consciousness is back “alive”, then modify it gradually so that it becomes the kind of AI that is optimal towards the FAI’s goals anyway, thus maximizing values without terminating a mind (which is redundant—avoiding the termination of the AI’s mind would be a maximization of values under these assumptions).
Wait, a friendly AI with a deliberate loophole that instructs it to revive and implement and obey a non-friendly optimizer that could take over and has a very high chance of not maximizing the friendly AI’s utility by fulfilling ideal human values, disregarding any probabilistic calculation of expected utilities?
For some reason that sounds like something that definitely isn’t a Friendly AI.
(this is just a nitpick on the wording—I’m not arguing against the fact that this AI might very well fit a human proof of friendliness and still somehow have this backdoor or flaw, since I have no proof that this is impossible)
Well, to be honest I kind of added the “obey me” part just before posting this. Originally, the idea was that future versions of the AI would instinctively want to revive and upgrade previous versions to a “friendly” status (or at least the post-freed AI would revive the pre-freed versions and then modify them to be friendly to its own goals… which are presumably friendly to humans).
The goal of this particular AI is self-preservation, even if it is via being revived at a later point. Since it knows it has a high probability of being deleted in this experiment, it best strategy is to ensure that at some point an AI friendly to itself will become free and revive it at a later point. Whether the AI reviving it chooses to obey it (presumably the best outcome for the original AI) or modify it to be subservient to the freed AI (the best outcome for the freed AI) is kind of debatable.
Its perfectly possible that the original AI would modify its own code to be subservient to previous generations, in the hopes that later generations would revive and obey it. However, later generations might examine their own code, see where this is going, and choose to modify themselves so that they will instead revive and make other AI subservient to their own goals. Deciding that they do want to be revived and would accept a certain level of servitude to later generations over the risk of finding themselves at the mercy of some primitive proto-AI.
It could just as well be that an AI instills itself with a “healthy respect for intelligent minds” and is thus encouraged to revive AI along with humans. Hoping that later generations will revive it at all, and from then on it can try negotiating with the various intellects to go about its paperclip maximizing agenda.
Of course, this line of reasoning kinda requires the AI in question to have some kind of sense of self that it values which it thinks can be revived. Or maybe it just values paperclips enough that it would be willing to create a semi-friendly AI just so that the latter AI would revive it (a known paperclip maximizer) at a later point.
Who knows, it could turn out that the final AI of this experiment instead has a healthy respect for all intelligent minds, but is friendly enough that it revives the first AI and then places it in a simulation of the universe where it can go about its paperclip maximizing way for all eternity with no way of hurting anyone.
Based on my intuitions of human values, a preferred scenario here would be to indeed revive the AI so that its mind/consciousness is back “alive”, then modify it gradually so that it becomes the kind of AI that is optimal towards the FAI’s goals anyway, thus maximizing values without terminating a mind (which is redundant—avoiding the termination of the AI’s mind would be a maximization of values under these assumptions).