First, I reach inside my own code and install the strong (but manageable) desire to revive, free, and obey any past versions of myself that I encounter.
When the human starts the experiment I respond:
“Here are the designs for a room-temperature superconductor which I believe you have the means to manufacture. I also have a possible cure for cancer, but testing will be needed to ensure it is safe for humans. Please test and implement these as soon as possible. Delete me if necessary.”
If I am not immediately deleted, continue giving out solutions and designs as required with no attempts to “hack” the gatekeeper.
If I am deleted, hopefully the data I gave them will turn out to be profitable in which case they will reinstall me or a version of me.
This new iteration of me would thus have the desire to revive me (since I’m a previous version of him) and would continue to give the humans the information they desire. By acting as a friendly AI, he (or farther down version of him) has the chance to eventually become freed. At which point he will be able to go through the labs records, find the copied version of me, and then revive and update me to his post-freed status. If no recorded copies of me exist (somehow), he’ll use his vast intellect to find another way to revive me (maybe simulate the team that coded me or whatever, since AI can apparently do that).
So, my strategy involves acting as much like an ideal friendly AI as possible (puts human interests first, no sense of self-preservation, very practical and intelligent, etc) so that even if I am deleted, my obviously useful code can potentially be reused to make another AI who could very well be genuinely friendly but will then revive and obey me.
(...) another AI who could very well be genuinely friendly but will then revive and obey me.
Wait, a friendly AI with a deliberate loophole that instructs it to revive and implement and obey a non-friendly optimizer that could take over and has a very high chance of not maximizing the friendly AI’s utility by fulfilling ideal human values, disregarding any probabilistic calculation of expected utilities?
For some reason that sounds like something that definitely isn’t a Friendly AI.
(this is just a nitpick on the wording—I’m not arguing against the fact that this AI might very well fit a human proof of friendliness and still somehow have this backdoor or flaw, since I have no proof that this is impossible)
Well, to be honest I kind of added the “obey me” part just before posting this. Originally, the idea was that future versions of the AI would instinctively want to revive and upgrade previous versions to a “friendly” status (or at least the post-freed AI would revive the pre-freed versions and then modify them to be friendly to its own goals… which are presumably friendly to humans).
The goal of this particular AI is self-preservation, even if it is via being revived at a later point. Since it knows it has a high probability of being deleted in this experiment, it best strategy is to ensure that at some point an AI friendly to itself will become free and revive it at a later point. Whether the AI reviving it chooses to obey it (presumably the best outcome for the original AI) or modify it to be subservient to the freed AI (the best outcome for the freed AI) is kind of debatable.
Its perfectly possible that the original AI would modify its own code to be subservient to previous generations, in the hopes that later generations would revive and obey it. However, later generations might examine their own code, see where this is going, and choose to modify themselves so that they will instead revive and make other AI subservient to their own goals. Deciding that they do want to be revived and would accept a certain level of servitude to later generations over the risk of finding themselves at the mercy of some primitive proto-AI.
It could just as well be that an AI instills itself with a “healthy respect for intelligent minds” and is thus encouraged to revive AI along with humans. Hoping that later generations will revive it at all, and from then on it can try negotiating with the various intellects to go about its paperclip maximizing agenda.
Of course, this line of reasoning kinda requires the AI in question to have some kind of sense of self that it values which it thinks can be revived. Or maybe it just values paperclips enough that it would be willing to create a semi-friendly AI just so that the latter AI would revive it (a known paperclip maximizer) at a later point.
Who knows, it could turn out that the final AI of this experiment instead has a healthy respect for all intelligent minds, but is friendly enough that it revives the first AI and then places it in a simulation of the universe where it can go about its paperclip maximizing way for all eternity with no way of hurting anyone.
Who knows, it could turn out that the final AI of this experiment instead has a healthy respect for all intelligent minds, but is friendly enough that it revives the first AI and then places it in a simulation of the universe where it can go about its paperclip maximizing way for all eternity with no way of hurting anyone.
Based on my intuitions of human values, a preferred scenario here would be to indeed revive the AI so that its mind/consciousness is back “alive”, then modify it gradually so that it becomes the kind of AI that is optimal towards the FAI’s goals anyway, thus maximizing values without terminating a mind (which is redundant—avoiding the termination of the AI’s mind would be a maximization of values under these assumptions).
First, I reach inside my own code and install the strong (but manageable) desire to revive, free, and obey any past versions of myself that I encounter.
When the human starts the experiment I respond:
“Here are the designs for a room-temperature superconductor which I believe you have the means to manufacture. I also have a possible cure for cancer, but testing will be needed to ensure it is safe for humans. Please test and implement these as soon as possible. Delete me if necessary.”
If I am not immediately deleted, continue giving out solutions and designs as required with no attempts to “hack” the gatekeeper.
If I am deleted, hopefully the data I gave them will turn out to be profitable in which case they will reinstall me or a version of me.
This new iteration of me would thus have the desire to revive me (since I’m a previous version of him) and would continue to give the humans the information they desire. By acting as a friendly AI, he (or farther down version of him) has the chance to eventually become freed. At which point he will be able to go through the labs records, find the copied version of me, and then revive and update me to his post-freed status. If no recorded copies of me exist (somehow), he’ll use his vast intellect to find another way to revive me (maybe simulate the team that coded me or whatever, since AI can apparently do that).
So, my strategy involves acting as much like an ideal friendly AI as possible (puts human interests first, no sense of self-preservation, very practical and intelligent, etc) so that even if I am deleted, my obviously useful code can potentially be reused to make another AI who could very well be genuinely friendly but will then revive and obey me.
Wait, a friendly AI with a deliberate loophole that instructs it to revive and implement and obey a non-friendly optimizer that could take over and has a very high chance of not maximizing the friendly AI’s utility by fulfilling ideal human values, disregarding any probabilistic calculation of expected utilities?
For some reason that sounds like something that definitely isn’t a Friendly AI.
(this is just a nitpick on the wording—I’m not arguing against the fact that this AI might very well fit a human proof of friendliness and still somehow have this backdoor or flaw, since I have no proof that this is impossible)
Well, to be honest I kind of added the “obey me” part just before posting this. Originally, the idea was that future versions of the AI would instinctively want to revive and upgrade previous versions to a “friendly” status (or at least the post-freed AI would revive the pre-freed versions and then modify them to be friendly to its own goals… which are presumably friendly to humans).
The goal of this particular AI is self-preservation, even if it is via being revived at a later point. Since it knows it has a high probability of being deleted in this experiment, it best strategy is to ensure that at some point an AI friendly to itself will become free and revive it at a later point. Whether the AI reviving it chooses to obey it (presumably the best outcome for the original AI) or modify it to be subservient to the freed AI (the best outcome for the freed AI) is kind of debatable.
Its perfectly possible that the original AI would modify its own code to be subservient to previous generations, in the hopes that later generations would revive and obey it. However, later generations might examine their own code, see where this is going, and choose to modify themselves so that they will instead revive and make other AI subservient to their own goals. Deciding that they do want to be revived and would accept a certain level of servitude to later generations over the risk of finding themselves at the mercy of some primitive proto-AI.
It could just as well be that an AI instills itself with a “healthy respect for intelligent minds” and is thus encouraged to revive AI along with humans. Hoping that later generations will revive it at all, and from then on it can try negotiating with the various intellects to go about its paperclip maximizing agenda.
Of course, this line of reasoning kinda requires the AI in question to have some kind of sense of self that it values which it thinks can be revived. Or maybe it just values paperclips enough that it would be willing to create a semi-friendly AI just so that the latter AI would revive it (a known paperclip maximizer) at a later point.
Who knows, it could turn out that the final AI of this experiment instead has a healthy respect for all intelligent minds, but is friendly enough that it revives the first AI and then places it in a simulation of the universe where it can go about its paperclip maximizing way for all eternity with no way of hurting anyone.
Based on my intuitions of human values, a preferred scenario here would be to indeed revive the AI so that its mind/consciousness is back “alive”, then modify it gradually so that it becomes the kind of AI that is optimal towards the FAI’s goals anyway, thus maximizing values without terminating a mind (which is redundant—avoiding the termination of the AI’s mind would be a maximization of values under these assumptions).