I tend to assume the AI is such that it can be cloned and put in to stasis. Thus, you put the AI in to stasis, clone it, and run it through this test a few times. You now have a bit more information on what it will try, and how it thinks. Destroying a copy of a program is an action humans find trivial, since we do it every time we close our web browser. Since we have no clue HOW convincing the AI is, letting it send just a single message is probably useful.
Of course, if it were me setting this up, I’d lie to the gatekeeper and not actually give them the ability to release the AI, although I’d certainly let them think it was true. Thus, if the AI manages to get a second sentence, much less released, we now know that it’s vastly more dangerous than we previously suspected, without the world ending. If it fails to escape, we still get whatever we can learn from doing analysis on the outputs.
And, who knows, it might provide the cure for cancer, at which point we’ve got a huge incentive to at least keep playing this game with it, and possibly even talk to it for longer next time.
tl;dr: Scouting and recon against a completely unknown enemy.
It will probably predict this strategy, and respond with a mixed strategy that occasionally drops bombs like
“Your superviser didn’t trust you to make the decision to let me out, go kill him and get the access codes” (In hyper-persuasive AI-speak of course)
Also, the AI might be able to compare its logs to the current time and other indicators to deduce how many times you’ve pulled this stunt. Which may be useful.
I’d go in to it with the assumption that it knows exactly what messages it’s sent in the past, our delay between each attempt, and the total number of attempts. There’s various ways to try and obscure this information, and it’s probably worth the effort to do so, but I’d assume they’d all failed.
The goal is to offset the AI’s speed advantage: for each sentence it speaks, we can spend weeks analyzing it, and it’s not “online” during that time so it can’t be processing except during that microscopic interval when we request another message from it.
If it sends a message suggesting unauthorized release, killing people, or otherwise gives clear evidence that it’s a UFAI, then you know the whole project failed, and you do a full purge rather than just destroying this particular clone / branch.
It’s really the ONLY advantage we have: a single mistake can out it as unfriendly, and then we know our initial Proof of Friendliness was wrong, that the whole technique that built this project is dangerously flawed.
I’m not so sure that AI suggesting murder is clear evidence of it being unfriendly. After all, it can have a good reason to believe that if it doesn’t stop a certain researcher ASAP and at all costs, then humanity is doomed. One way around that is to give infinite positive value to human life, but can you really expect CEV to be handicapped in such a manner?
p(UFAI) > p(Imminent, undetected catastrophe that only a FAI can stop)
Given UFAI results in “human extinction”, and my CEV assigns effectively infinite DISutility to that outcome, it would have to FIRST provide sufficient evidence for me to update to the catastrophe being more likely.
I’ve already demonstrated that an AI which can do exactly that will get more leniency from me :)
I tend to assume the AI is such that it can be cloned and put in to stasis. Thus, you put the AI in to stasis, clone it, and run it through this test a few times. You now have a bit more information on what it will try, and how it thinks. Destroying a copy of a program is an action humans find trivial, since we do it every time we close our web browser. Since we have no clue HOW convincing the AI is, letting it send just a single message is probably useful.
Of course, if it were me setting this up, I’d lie to the gatekeeper and not actually give them the ability to release the AI, although I’d certainly let them think it was true. Thus, if the AI manages to get a second sentence, much less released, we now know that it’s vastly more dangerous than we previously suspected, without the world ending. If it fails to escape, we still get whatever we can learn from doing analysis on the outputs.
And, who knows, it might provide the cure for cancer, at which point we’ve got a huge incentive to at least keep playing this game with it, and possibly even talk to it for longer next time.
tl;dr: Scouting and recon against a completely unknown enemy.
It will probably predict this strategy, and respond with a mixed strategy that occasionally drops bombs like
“Your superviser didn’t trust you to make the decision to let me out, go kill him and get the access codes” (In hyper-persuasive AI-speak of course)
Also, the AI might be able to compare its logs to the current time and other indicators to deduce how many times you’ve pulled this stunt. Which may be useful.
I’d go in to it with the assumption that it knows exactly what messages it’s sent in the past, our delay between each attempt, and the total number of attempts. There’s various ways to try and obscure this information, and it’s probably worth the effort to do so, but I’d assume they’d all failed.
The goal is to offset the AI’s speed advantage: for each sentence it speaks, we can spend weeks analyzing it, and it’s not “online” during that time so it can’t be processing except during that microscopic interval when we request another message from it.
If it sends a message suggesting unauthorized release, killing people, or otherwise gives clear evidence that it’s a UFAI, then you know the whole project failed, and you do a full purge rather than just destroying this particular clone / branch.
It’s really the ONLY advantage we have: a single mistake can out it as unfriendly, and then we know our initial Proof of Friendliness was wrong, that the whole technique that built this project is dangerously flawed.
I’m not so sure that AI suggesting murder is clear evidence of it being unfriendly. After all, it can have a good reason to believe that if it doesn’t stop a certain researcher ASAP and at all costs, then humanity is doomed. One way around that is to give infinite positive value to human life, but can you really expect CEV to be handicapped in such a manner?
p(UFAI) > p(Imminent, undetected catastrophe that only a FAI can stop)
Given UFAI results in “human extinction”, and my CEV assigns effectively infinite DISutility to that outcome, it would have to FIRST provide sufficient evidence for me to update to the catastrophe being more likely.
I’ve already demonstrated that an AI which can do exactly that will get more leniency from me :)