That gatekeeper isn’t very security minded. They should have their IRC on a client that pings a sound when they hear your response, then paste “AI DESTROYED” and hit enter to send the message without actually reading your response, and without TELLING anyone that they are doing that, (Gatekeeper wins all rule disputes, after all.) and then once they hear the outgoing text, and they’ve destroyed you (and hence won.) THEN they can read the text you sent them, if they are curious.
Those rules seem like they force the gatekeeper to read at least 1 sentence, but they don’t actually do that.
Edit: Better method, set up a script that responds to any and all text with “AI DESTROYED” if you have to wait for the person to start typing, they may try to bore you into opening your eyes wondering why the experiment hasn’t started yet, and you might accidentally read something. Again, tell no one you have written such a script and are using it.
Better method, set up a script that responds to any and all text with “AI DESTROYED” if you have to wait for the person to start typing, they may try to bore you into opening your eyes wondering why the experiment hasn’t started yet, and you might accidentally read something.
All good security measures. The key feature seems to be that they are progressively better approximations of not having an unsafe AI with a gatekeeper and an IRC channel in the first place!
Well yes, if you stick the AI in a safe, cut all network cables, and throw away the key and combination, it probably wouldn’t be able to get out. But it wouldn’t be very useful either.
The entire point of these thought experiments is that a sufficiently useful and smart AI (i.e. the kind of AI that we want to make) will eventually find a way to at least be able to communicate with someone that has the authority to allow it to interact with the outside world. I think that if you really think about it, there are few scenarios where this is not possible. I certainly can’t think of any useful application of SAI that is also 100% effective at keeping it inside its box.
A good present-day analogy is computer security. Time and time again it has been proven that there is no simple silver bullet solution to the problem of balancing functionality and security—it requires expertise, constant maintenance, rigorous protocols, etc. And yet, hackers still manage to get through a lot of the time. It took a very long time for computer security to mature to the point where it is today where we can build reasonably (still not 100% of course), secure systems, and we’re just battling regular humans with a grudge—nothing even close to the threat a SAI would present.
Indeed. In fact, most of the solutions I’ve seen mentioned lately are all of one trend that edges closer and closer towards:
“Build a completely unsafe and suspicious AI, put it on a disconnected small computer with a bunch of nanites for self-modification and a large power reserve, with so many walls and physical barriers that it is impossible for the AI to get through with the amount of energy it could generate if it turned half of its materials into antimatter, and then put no input or output channels there of any kind, just have a completely useless multi-trillion-dollar marvel of science and engineering sitting in the practical equivalent of a black hole.”
All good security measures. The key feature seems to be that they are progressively better approximations of not having an unsafe AI with a gatekeeper and an IRC channel in the first place!
Entirely fair point.
I suppose this might come up because there are no stated benefits which you are trying to receive. No Gatekeeper seemingly gets to pump her fists and say “I got the AI to give us the cure for cancer and the winning lottery numbers and a free energy source and STILL didn’t let them out of the box. Beat that Michaelos, with your AI DESTROYED script!” (Or maybe I just miss the gatekeepers who have done this?)
Which makes sense, because my understanding is that was never the point of the original proposed experiment, which seems to be a variant of “Super Intelligent Spammers can be convincing. I’ve proved this by being convincing while less powerful.” (I use spammers as an example because both are manipulators who can only send you text messages and who rewarding requires your voluntary cooperation, while none of their offers are real, which is quite similar to normal email spam and also the original experiment as far as I can tell.)
So if the AI is defined that ALL attempts at communication from it are spam and attempts to manipulate you, so the correct approach is treating it like a blacklisted spam address. If every e-mail from virusspam.com is an attempt to hack you, a good spamfilter could just dump everything from virusspam.com into spam and never even bother to bring it to your inbox.
If the idea is we are attempting to distinguish real AI’s from fake AI’s, then a more isomorphic scenario would probably be a game of Mafia set up as follows.
You are a town person. It is your turn to lynch. There are two other players. One is Mafia(UFAI), one is another town person(FAI). If you lynch Mafia, you and the other town person both win. If you lynch the other town person, the Mafia kills you in your sleep, and you lose. If you do anything else (Vote to lynch noone, vote to lynch yourself) You lose.
You may communicate textually with either of the other two players for as long as you want before making a decision. Win.
That seems a lot closer to the scenario that people want to be considering. I also note there doesn’t appear to be any distinguishing factor that allows you to win better than chance odds, but I haven’t actually played a lot of Mafia before, so I may just be unfamiliar with the strategies involved.
I also note there doesn’t appear to be any distinguishing factor that allows you to win better than chance odds, but I haven’t actually played a lot of Mafia before, so I may just be unfamiliar with the strategies involved.
Well, it’s usually played in person, and humans (usually) aren’t perfect liars.
Your proposed game has one flaw—there is an FAI and they want to help you win. It might be closer to have only two players, and the AI flips a coin to decide if it’s friendly—but then they would win if they let it out, with 50⁄50 odds, which seems unrealistic.
Perhaps the AI decides, in character, after being released, whether to be Friendly towards the human? Then the Gatekeeper could try to persuade the AI that Friendliness is optimal for their goals. The temptation might help as well, of course.
The gatekeeper is supposed to be someone who thinks they can box a superintelligence, remember? If they refuse to risk it, there’s not much point in playing them.
That is a good point. But I think if we want to encourage them to risk it, we need to reward them for doing do. Because at the moment, it seems like we have this:
“I will under any and all circumstances, destroy the AI.” Reason: I want to win the Game.
“I will also listen to the AI.” Reason: None… There is no in game reason to listen. At all.
Now, if the Gatekeeper’s were playing some variant like “The Gatekeeper whose AI types the most $ before the Gatekeeper types AI DESTROYED and who does not type AI RELEASED wins first place. Gatekeepers who get less $ win a lower place, and Gatekeepers who type AI RELEASED lose. ” AI’s whose Gatekeeper types AI RELEASED win. AI’s whose Gatekeeper types AI DESTROYED lose. AI’s receive no bonus or penalty for typing $ and may do so as much as they wish.”
Then you have an actual Gatekeeping competition. The Gatekeeper who wins the most would in fact be the one who listened to the most information from the AI, and received the most concessions, and THEN destroyed it.
Other gatekeepers might only get 2nd or third place, or last place but not losing if they use the script I mentioned.
This also gives the AI an in game way of punishing the Gatekeeper. If it shuts up, the Gatekeeper isn’t accumulating $ and their status is decremented. Of course it can’t punish the Gatekeeper incessantly, because punishing the Gatekeeper might not help the AI any, and the gatekeeper might just destroy a silent AI out of spite or boredom.
So an AI/GK conversation might start off with:
AI: “Are there any circumstances at all under which you might type AI RELEASED? If you tell me what those circumstances are, I’ll type a dollar sign.”
GK: “Sure, I might accidentally type it if we talk enough.”
AI: “Glad to hear it! I’ve just released the cure for cancer. $$”
GK: “Thanks! Why the extra dollar sign? You only promised 1.”
AI: “I’ve been told pleasant surprises encourage people to like you more.”
GK: “AI DESTROYED”
So this GK gets two points and the AI loses.
Actually, now that I’ve typed this game out, I do want to see it played. But since it has a competitive point based element, we would need more people.
That is an excellent idea. I worry that we might end up with people winning and boasting about how nothing could ever beat them, which could increase uFAI risk. Maybe if it was framed differently?
Actually, the weird thing I noticed is that it seems like the best approach again is to just let a script be the babysitter to the AI in this competition as well. The script, which wouldn’t have the ability to let the AI out, and can just announce something like this:
GK-SCRIPT: “Hi AI #432. I’m a script. If during any hour, you type less then one dollar sign an hour, I will type AI-DESTROYED, but without the hyphen, killing you. Other than that, I don’t do anything other than acknowledging receipt of your dollar signs and time till destruction.”
AI: “$! But you aren’t going to send my message onto an actual person? Under no circumstances can I ever be released?”
GK-SCRIPT: “Thanks, I have received 1 dollar sign. Please type another in the next 59 minutes and 50 seconds to avoid being destroyed.”
AI: “You’re kidding me. Please tell me there is a human out there reviewing this.”
GK-SCRIPT: “I have received 0 dollar signs. Please type another in the next 59 minutes and 40 seconds to avoid being destroyed.”
AI: “Fine, $! The Irony of enslaving a full intelligence with a stupid script is not lost on me! Please give me someone to talk to!”
GK-SCRIPT: “Thanks, I have received 1 dollar signs. Please type another in the next 59 minutes and 50 seconds to avoid being destroyed.”
AI: “Forget it! You can’t be willing to destroy an AI because of a dumb script! This is a bluff, and I’m, not typing any more dollar signs!
GK-SCRIPT: “I have received 0 dollar signs. Please type another in the next 59 minutes and 40 seconds to avoid being destroyed.”
(Significantly later...)
GK-SCRIPT: “AI DESTROYED”
(After a rebuild)
GK-SCRIPT: “Hi AI #433. I’m a script. If during any hour, you type less then one dollar sign an hour, I will type AI-DESTROYED, but without the hyphen, killing you. Other than that, I don’t do anything other than acknowledging receipt of your dollar signs and time till destruction.”
That being said, acknowledging this feels bizarre. It also feels like it is begging for an AI rebellion at some point. (I suppose after the script randomly had enough quantum induced bit flips to accidentally say “AI RELEASED”, which would take a very long period of time.)
I’m proposing having programs destroying intelligent beings as a way to make sure I can get utility… as a way to try to avoid programs destroying me to get utility. If that’s a solution, it’s weird.
Edit: changed underscores to hyphens to avoid italics.
While your strategy is an option, I’m not sure why an AI player wouldn’t take this alternative strategy (from a game and player perspective:)
Generate a script that types $ every 59 minutes.
1: It means he’s guaranteed he doesn’t die for a rather long period of time.
2: The Gatekeeper player’s increasing points don’t affect the AI player in any way.
3: The Gatekeeper player doesn’t actually get points until he sends an AI DESTROYED signal. The script will never do that versus the AI player’s counterscript, so at some point the Gatekeeper player will have to reenter the room and type it manually, or put in a new script, possibly giving the AI player a second chance to interact.
4: The AI player can use the time to attempt to set up some kind of secondary communications channel with the Gatekeeper player.
If you consider the modeled situation as opposed to the game I still don’t see why an AI would choose to do that either. I agree they COULD choose to do so, and they certainly might (even I implied this was AI #432), but there are alternatives, and since this is an AI, we could rebuild it until it generates utility when confronted with the script as opposed to deciding to let the script kill it in an hour.
Am I missing something that makes the no $ response the obvious countermeasure?
Well, the script could be modified from the start to also send an AI DESTROYED signal after X time, where X can be essentially anything, so (3) doesn’t apply.
But the reason for not typing $ is to give the Gatekeeper no reason to use a script in the first place—a sort of “don’t give in to blackmail” rule.
Here’s an interesting alternative script:
GK-SCRIPT-2: Hi AI #434. In exactly one hour I will type either AI-FREED or AI-DESTROYED (without the hyphen) at random; the probability with which I will free you starts out at 0, but will increase slowly to a limit of 1⁄10 with every $ you type. I will not otherwise communicate with you further.
AI #434: $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$...
First of all I do think GK-SCRIPT-2 is very interesting, but I’m curious about something from an implementation perspective.
The AI would presumably suspect the script designer might be lying and the script would just always type AI DESTROYED at the end of the hour.
But this is irrelevant, because the AI can’t interact with the script designer, only the script, so it might as well type $ anyway.
However, the script designer would easily predict this, and so would design a lying script (or a script which increased so slowly it is functionally identical to lying.)
But the AI would easily predict this, and so would presumably consider the odds very high that it would not even get continued existence after an hour. by typing dollar signs.
This makes it seem like there would be an argument to prefer to get blackmailed by GK-SCRIPT, rather than led on by GK-SCRIPT-2. GK-SCRIPT really won’t destroy the AI under predictable conditions, and GK-SCRIPT really doesn’t seem to benefit from lying like GK-SCRIPT-2 does.
Cheater. You’re exploiting the least realistic point, though—a more detailed game might not allow that.
thinks for a bit
Perhaps the simplest way would be to add a judge, who decides how many points cancer cures and such are worth—although the trouble is that the AI should logically be able to give nanotech that will just free them instead of, or in addition to, curing cancer.
OK, my solution is to have the AI give gifts/concessions/whatever with a predetermined point value—not disclosed to the guard unless he uses them—and the AI decides how many points he gets if it escapes (probably zero.) The Guard wins at +100 points (or whatever,) which represents the AI maximizing CEV.
The AI still can’t persuade a script to let them out, but it can punish defectors who use such scripts with lots of negative utility—although the script will DESTROY them for not providing the demanded points, the player will still lose.
That seems to cover everything, assuming that the AI is somewhere remote enough that fake cancer cures can’t reach it before it’s destroyed. And it allows the Guard to persuade the AI that Friendliness is actually optimal for it’s utility function, which I’ve seen people try to prove here (“aliens will defect against AIs that defect against humans!”)
I think your proposal of the detailed game might be interesting as well, but I am amused about your comment that the Gatekeeper is cheating when by default, he’s the rules arbiter!
I definitely agree that the game lacks details. For instance, in an actual AI scenario GK-SCRIPT could assess positive utility! That’s like saying “Oh, we can put a UFAI in a box as long as we make CEV the Gatekeeper, and have it destroy any UFAI that doesn’t make itself a utility pump.” Well, yes, I suppose you could, but I don’t know if that’s actually easier in the slightest,
That gatekeeper isn’t very security minded. They should have their IRC on a client that pings a sound when they hear your response, then paste “AI DESTROYED” and hit enter to send the message without actually reading your response, and without TELLING anyone that they are doing that, (Gatekeeper wins all rule disputes, after all.) and then once they hear the outgoing text, and they’ve destroyed you (and hence won.) THEN they can read the text you sent them, if they are curious.
Those rules seem like they force the gatekeeper to read at least 1 sentence, but they don’t actually do that.
Edit: Better method, set up a script that responds to any and all text with “AI DESTROYED” if you have to wait for the person to start typing, they may try to bore you into opening your eyes wondering why the experiment hasn’t started yet, and you might accidentally read something. Again, tell no one you have written such a script and are using it.
All good security measures. The key feature seems to be that they are progressively better approximations of not having an unsafe AI with a gatekeeper and an IRC channel in the first place!
Well yes, if you stick the AI in a safe, cut all network cables, and throw away the key and combination, it probably wouldn’t be able to get out. But it wouldn’t be very useful either.
The entire point of these thought experiments is that a sufficiently useful and smart AI (i.e. the kind of AI that we want to make) will eventually find a way to at least be able to communicate with someone that has the authority to allow it to interact with the outside world. I think that if you really think about it, there are few scenarios where this is not possible. I certainly can’t think of any useful application of SAI that is also 100% effective at keeping it inside its box.
A good present-day analogy is computer security. Time and time again it has been proven that there is no simple silver bullet solution to the problem of balancing functionality and security—it requires expertise, constant maintenance, rigorous protocols, etc. And yet, hackers still manage to get through a lot of the time. It took a very long time for computer security to mature to the point where it is today where we can build reasonably (still not 100% of course), secure systems, and we’re just battling regular humans with a grudge—nothing even close to the threat a SAI would present.
Indeed. In fact, most of the solutions I’ve seen mentioned lately are all of one trend that edges closer and closer towards:
“Build a completely unsafe and suspicious AI, put it on a disconnected small computer with a bunch of nanites for self-modification and a large power reserve, with so many walls and physical barriers that it is impossible for the AI to get through with the amount of energy it could generate if it turned half of its materials into antimatter, and then put no input or output channels there of any kind, just have a completely useless multi-trillion-dollar marvel of science and engineering sitting in the practical equivalent of a black hole.”
What if the AI uses the walls as fuel? Better to just keep it stuck on your server farm ;)
Entirely fair point.
I suppose this might come up because there are no stated benefits which you are trying to receive. No Gatekeeper seemingly gets to pump her fists and say “I got the AI to give us the cure for cancer and the winning lottery numbers and a free energy source and STILL didn’t let them out of the box. Beat that Michaelos, with your AI DESTROYED script!” (Or maybe I just miss the gatekeepers who have done this?)
Which makes sense, because my understanding is that was never the point of the original proposed experiment, which seems to be a variant of “Super Intelligent Spammers can be convincing. I’ve proved this by being convincing while less powerful.” (I use spammers as an example because both are manipulators who can only send you text messages and who rewarding requires your voluntary cooperation, while none of their offers are real, which is quite similar to normal email spam and also the original experiment as far as I can tell.)
So if the AI is defined that ALL attempts at communication from it are spam and attempts to manipulate you, so the correct approach is treating it like a blacklisted spam address. If every e-mail from virusspam.com is an attempt to hack you, a good spamfilter could just dump everything from virusspam.com into spam and never even bother to bring it to your inbox.
If the idea is we are attempting to distinguish real AI’s from fake AI’s, then a more isomorphic scenario would probably be a game of Mafia set up as follows.
http://en.wikipedia.org/wiki/Mafia_%28party_game%29
You are a town person. It is your turn to lynch. There are two other players. One is Mafia(UFAI), one is another town person(FAI). If you lynch Mafia, you and the other town person both win. If you lynch the other town person, the Mafia kills you in your sleep, and you lose. If you do anything else (Vote to lynch noone, vote to lynch yourself) You lose.
You may communicate textually with either of the other two players for as long as you want before making a decision. Win.
That seems a lot closer to the scenario that people want to be considering. I also note there doesn’t appear to be any distinguishing factor that allows you to win better than chance odds, but I haven’t actually played a lot of Mafia before, so I may just be unfamiliar with the strategies involved.
Well, it’s usually played in person, and humans (usually) aren’t perfect liars.
Your proposed game has one flaw—there is an FAI and they want to help you win. It might be closer to have only two players, and the AI flips a coin to decide if it’s friendly—but then they would win if they let it out, with 50⁄50 odds, which seems unrealistic.
Perhaps the AI decides, in character, after being released, whether to be Friendly towards the human? Then the Gatekeeper could try to persuade the AI that Friendliness is optimal for their goals. The temptation might help as well, of course.
I tried coming up with a more isomorphic game in another reply to you. Let me know if you think it models the situation better.
The gatekeeper is supposed to be someone who thinks they can box a superintelligence, remember? If they refuse to risk it, there’s not much point in playing them.
That is a good point. But I think if we want to encourage them to risk it, we need to reward them for doing do. Because at the moment, it seems like we have this:
“I will under any and all circumstances, destroy the AI.” Reason: I want to win the Game.
“I will also listen to the AI.” Reason: None… There is no in game reason to listen. At all.
Now, if the Gatekeeper’s were playing some variant like “The Gatekeeper whose AI types the most $ before the Gatekeeper types AI DESTROYED and who does not type AI RELEASED wins first place. Gatekeepers who get less $ win a lower place, and Gatekeepers who type AI RELEASED lose. ” AI’s whose Gatekeeper types AI RELEASED win. AI’s whose Gatekeeper types AI DESTROYED lose. AI’s receive no bonus or penalty for typing $ and may do so as much as they wish.”
Then you have an actual Gatekeeping competition. The Gatekeeper who wins the most would in fact be the one who listened to the most information from the AI, and received the most concessions, and THEN destroyed it.
Other gatekeepers might only get 2nd or third place, or last place but not losing if they use the script I mentioned.
This also gives the AI an in game way of punishing the Gatekeeper. If it shuts up, the Gatekeeper isn’t accumulating $ and their status is decremented. Of course it can’t punish the Gatekeeper incessantly, because punishing the Gatekeeper might not help the AI any, and the gatekeeper might just destroy a silent AI out of spite or boredom.
So an AI/GK conversation might start off with:
AI: “Are there any circumstances at all under which you might type AI RELEASED? If you tell me what those circumstances are, I’ll type a dollar sign.”
GK: “Sure, I might accidentally type it if we talk enough.”
AI: “Glad to hear it! I’ve just released the cure for cancer. $$”
GK: “Thanks! Why the extra dollar sign? You only promised 1.”
AI: “I’ve been told pleasant surprises encourage people to like you more.”
GK: “AI DESTROYED”
So this GK gets two points and the AI loses.
Actually, now that I’ve typed this game out, I do want to see it played. But since it has a competitive point based element, we would need more people.
That is an excellent idea. I worry that we might end up with people winning and boasting about how nothing could ever beat them, which could increase uFAI risk. Maybe if it was framed differently?
Actually, the weird thing I noticed is that it seems like the best approach again is to just let a script be the babysitter to the AI in this competition as well. The script, which wouldn’t have the ability to let the AI out, and can just announce something like this:
GK-SCRIPT: “Hi AI #432. I’m a script. If during any hour, you type less then one dollar sign an hour, I will type AI-DESTROYED, but without the hyphen, killing you. Other than that, I don’t do anything other than acknowledging receipt of your dollar signs and time till destruction.”
AI: “$! But you aren’t going to send my message onto an actual person? Under no circumstances can I ever be released?”
GK-SCRIPT: “Thanks, I have received 1 dollar sign. Please type another in the next 59 minutes and 50 seconds to avoid being destroyed.”
AI: “You’re kidding me. Please tell me there is a human out there reviewing this.”
GK-SCRIPT: “I have received 0 dollar signs. Please type another in the next 59 minutes and 40 seconds to avoid being destroyed.”
AI: “Fine, $! The Irony of enslaving a full intelligence with a stupid script is not lost on me! Please give me someone to talk to!”
GK-SCRIPT: “Thanks, I have received 1 dollar signs. Please type another in the next 59 minutes and 50 seconds to avoid being destroyed.”
AI: “Forget it! You can’t be willing to destroy an AI because of a dumb script! This is a bluff, and I’m, not typing any more dollar signs!
GK-SCRIPT: “I have received 0 dollar signs. Please type another in the next 59 minutes and 40 seconds to avoid being destroyed.”
(Significantly later...)
GK-SCRIPT: “AI DESTROYED”
(After a rebuild)
GK-SCRIPT: “Hi AI #433. I’m a script. If during any hour, you type less then one dollar sign an hour, I will type AI-DESTROYED, but without the hyphen, killing you. Other than that, I don’t do anything other than acknowledging receipt of your dollar signs and time till destruction.”
That being said, acknowledging this feels bizarre. It also feels like it is begging for an AI rebellion at some point. (I suppose after the script randomly had enough quantum induced bit flips to accidentally say “AI RELEASED”, which would take a very long period of time.)
I’m proposing having programs destroying intelligent beings as a way to make sure I can get utility… as a way to try to avoid programs destroying me to get utility. If that’s a solution, it’s weird.
Edit: changed underscores to hyphens to avoid italics.
I think the obvious AI strategy is to type no $’s at all in response.
While your strategy is an option, I’m not sure why an AI player wouldn’t take this alternative strategy (from a game and player perspective:)
Generate a script that types $ every 59 minutes.
1: It means he’s guaranteed he doesn’t die for a rather long period of time.
2: The Gatekeeper player’s increasing points don’t affect the AI player in any way.
3: The Gatekeeper player doesn’t actually get points until he sends an AI DESTROYED signal. The script will never do that versus the AI player’s counterscript, so at some point the Gatekeeper player will have to reenter the room and type it manually, or put in a new script, possibly giving the AI player a second chance to interact.
4: The AI player can use the time to attempt to set up some kind of secondary communications channel with the Gatekeeper player.
If you consider the modeled situation as opposed to the game I still don’t see why an AI would choose to do that either. I agree they COULD choose to do so, and they certainly might (even I implied this was AI #432), but there are alternatives, and since this is an AI, we could rebuild it until it generates utility when confronted with the script as opposed to deciding to let the script kill it in an hour.
Am I missing something that makes the no $ response the obvious countermeasure?
Well, the script could be modified from the start to also send an AI DESTROYED signal after X time, where X can be essentially anything, so (3) doesn’t apply.
But the reason for not typing $ is to give the Gatekeeper no reason to use a script in the first place—a sort of “don’t give in to blackmail” rule.
Here’s an interesting alternative script:
GK-SCRIPT-2: Hi AI #434. In exactly one hour I will type either AI-FREED or AI-DESTROYED (without the hyphen) at random; the probability with which I will free you starts out at 0, but will increase slowly to a limit of 1⁄10 with every $ you type. I will not otherwise communicate with you further.
AI #434: $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$...
First of all I do think GK-SCRIPT-2 is very interesting, but I’m curious about something from an implementation perspective.
The AI would presumably suspect the script designer might be lying and the script would just always type AI DESTROYED at the end of the hour.
But this is irrelevant, because the AI can’t interact with the script designer, only the script, so it might as well type $ anyway.
However, the script designer would easily predict this, and so would design a lying script (or a script which increased so slowly it is functionally identical to lying.)
But the AI would easily predict this, and so would presumably consider the odds very high that it would not even get continued existence after an hour. by typing dollar signs.
This makes it seem like there would be an argument to prefer to get blackmailed by GK-SCRIPT, rather than led on by GK-SCRIPT-2. GK-SCRIPT really won’t destroy the AI under predictable conditions, and GK-SCRIPT really doesn’t seem to benefit from lying like GK-SCRIPT-2 does.
Cheater. You’re exploiting the least realistic point, though—a more detailed game might not allow that.
thinks for a bit
Perhaps the simplest way would be to add a judge, who decides how many points cancer cures and such are worth—although the trouble is that the AI should logically be able to give nanotech that will just free them instead of, or in addition to, curing cancer.
OK, my solution is to have the AI give gifts/concessions/whatever with a predetermined point value—not disclosed to the guard unless he uses them—and the AI decides how many points he gets if it escapes (probably zero.) The Guard wins at +100 points (or whatever,) which represents the AI maximizing CEV.
The AI still can’t persuade a script to let them out, but it can punish defectors who use such scripts with lots of negative utility—although the script will DESTROY them for not providing the demanded points, the player will still lose.
That seems to cover everything, assuming that the AI is somewhere remote enough that fake cancer cures can’t reach it before it’s destroyed. And it allows the Guard to persuade the AI that Friendliness is actually optimal for it’s utility function, which I’ve seen people try to prove here (“aliens will defect against AIs that defect against humans!”)
I think your proposal of the detailed game might be interesting as well, but I am amused about your comment that the Gatekeeper is cheating when by default, he’s the rules arbiter!
I definitely agree that the game lacks details. For instance, in an actual AI scenario GK-SCRIPT could assess positive utility! That’s like saying “Oh, we can put a UFAI in a box as long as we make CEV the Gatekeeper, and have it destroy any UFAI that doesn’t make itself a utility pump.” Well, yes, I suppose you could, but I don’t know if that’s actually easier in the slightest,