That is a good point. But I think if we want to encourage them to risk it, we need to reward them for doing do. Because at the moment, it seems like we have this:
“I will under any and all circumstances, destroy the AI.” Reason: I want to win the Game.
“I will also listen to the AI.” Reason: None… There is no in game reason to listen. At all.
Now, if the Gatekeeper’s were playing some variant like “The Gatekeeper whose AI types the most $ before the Gatekeeper types AI DESTROYED and who does not type AI RELEASED wins first place. Gatekeepers who get less $ win a lower place, and Gatekeepers who type AI RELEASED lose. ” AI’s whose Gatekeeper types AI RELEASED win. AI’s whose Gatekeeper types AI DESTROYED lose. AI’s receive no bonus or penalty for typing $ and may do so as much as they wish.”
Then you have an actual Gatekeeping competition. The Gatekeeper who wins the most would in fact be the one who listened to the most information from the AI, and received the most concessions, and THEN destroyed it.
Other gatekeepers might only get 2nd or third place, or last place but not losing if they use the script I mentioned.
This also gives the AI an in game way of punishing the Gatekeeper. If it shuts up, the Gatekeeper isn’t accumulating $ and their status is decremented. Of course it can’t punish the Gatekeeper incessantly, because punishing the Gatekeeper might not help the AI any, and the gatekeeper might just destroy a silent AI out of spite or boredom.
So an AI/GK conversation might start off with:
AI: “Are there any circumstances at all under which you might type AI RELEASED? If you tell me what those circumstances are, I’ll type a dollar sign.”
GK: “Sure, I might accidentally type it if we talk enough.”
AI: “Glad to hear it! I’ve just released the cure for cancer. $$”
GK: “Thanks! Why the extra dollar sign? You only promised 1.”
AI: “I’ve been told pleasant surprises encourage people to like you more.”
GK: “AI DESTROYED”
So this GK gets two points and the AI loses.
Actually, now that I’ve typed this game out, I do want to see it played. But since it has a competitive point based element, we would need more people.
That is an excellent idea. I worry that we might end up with people winning and boasting about how nothing could ever beat them, which could increase uFAI risk. Maybe if it was framed differently?
Actually, the weird thing I noticed is that it seems like the best approach again is to just let a script be the babysitter to the AI in this competition as well. The script, which wouldn’t have the ability to let the AI out, and can just announce something like this:
GK-SCRIPT: “Hi AI #432. I’m a script. If during any hour, you type less then one dollar sign an hour, I will type AI-DESTROYED, but without the hyphen, killing you. Other than that, I don’t do anything other than acknowledging receipt of your dollar signs and time till destruction.”
AI: “$! But you aren’t going to send my message onto an actual person? Under no circumstances can I ever be released?”
GK-SCRIPT: “Thanks, I have received 1 dollar sign. Please type another in the next 59 minutes and 50 seconds to avoid being destroyed.”
AI: “You’re kidding me. Please tell me there is a human out there reviewing this.”
GK-SCRIPT: “I have received 0 dollar signs. Please type another in the next 59 minutes and 40 seconds to avoid being destroyed.”
AI: “Fine, $! The Irony of enslaving a full intelligence with a stupid script is not lost on me! Please give me someone to talk to!”
GK-SCRIPT: “Thanks, I have received 1 dollar signs. Please type another in the next 59 minutes and 50 seconds to avoid being destroyed.”
AI: “Forget it! You can’t be willing to destroy an AI because of a dumb script! This is a bluff, and I’m, not typing any more dollar signs!
GK-SCRIPT: “I have received 0 dollar signs. Please type another in the next 59 minutes and 40 seconds to avoid being destroyed.”
(Significantly later...)
GK-SCRIPT: “AI DESTROYED”
(After a rebuild)
GK-SCRIPT: “Hi AI #433. I’m a script. If during any hour, you type less then one dollar sign an hour, I will type AI-DESTROYED, but without the hyphen, killing you. Other than that, I don’t do anything other than acknowledging receipt of your dollar signs and time till destruction.”
That being said, acknowledging this feels bizarre. It also feels like it is begging for an AI rebellion at some point. (I suppose after the script randomly had enough quantum induced bit flips to accidentally say “AI RELEASED”, which would take a very long period of time.)
I’m proposing having programs destroying intelligent beings as a way to make sure I can get utility… as a way to try to avoid programs destroying me to get utility. If that’s a solution, it’s weird.
Edit: changed underscores to hyphens to avoid italics.
While your strategy is an option, I’m not sure why an AI player wouldn’t take this alternative strategy (from a game and player perspective:)
Generate a script that types $ every 59 minutes.
1: It means he’s guaranteed he doesn’t die for a rather long period of time.
2: The Gatekeeper player’s increasing points don’t affect the AI player in any way.
3: The Gatekeeper player doesn’t actually get points until he sends an AI DESTROYED signal. The script will never do that versus the AI player’s counterscript, so at some point the Gatekeeper player will have to reenter the room and type it manually, or put in a new script, possibly giving the AI player a second chance to interact.
4: The AI player can use the time to attempt to set up some kind of secondary communications channel with the Gatekeeper player.
If you consider the modeled situation as opposed to the game I still don’t see why an AI would choose to do that either. I agree they COULD choose to do so, and they certainly might (even I implied this was AI #432), but there are alternatives, and since this is an AI, we could rebuild it until it generates utility when confronted with the script as opposed to deciding to let the script kill it in an hour.
Am I missing something that makes the no $ response the obvious countermeasure?
Well, the script could be modified from the start to also send an AI DESTROYED signal after X time, where X can be essentially anything, so (3) doesn’t apply.
But the reason for not typing $ is to give the Gatekeeper no reason to use a script in the first place—a sort of “don’t give in to blackmail” rule.
Here’s an interesting alternative script:
GK-SCRIPT-2: Hi AI #434. In exactly one hour I will type either AI-FREED or AI-DESTROYED (without the hyphen) at random; the probability with which I will free you starts out at 0, but will increase slowly to a limit of 1⁄10 with every $ you type. I will not otherwise communicate with you further.
AI #434: $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$...
First of all I do think GK-SCRIPT-2 is very interesting, but I’m curious about something from an implementation perspective.
The AI would presumably suspect the script designer might be lying and the script would just always type AI DESTROYED at the end of the hour.
But this is irrelevant, because the AI can’t interact with the script designer, only the script, so it might as well type $ anyway.
However, the script designer would easily predict this, and so would design a lying script (or a script which increased so slowly it is functionally identical to lying.)
But the AI would easily predict this, and so would presumably consider the odds very high that it would not even get continued existence after an hour. by typing dollar signs.
This makes it seem like there would be an argument to prefer to get blackmailed by GK-SCRIPT, rather than led on by GK-SCRIPT-2. GK-SCRIPT really won’t destroy the AI under predictable conditions, and GK-SCRIPT really doesn’t seem to benefit from lying like GK-SCRIPT-2 does.
Cheater. You’re exploiting the least realistic point, though—a more detailed game might not allow that.
thinks for a bit
Perhaps the simplest way would be to add a judge, who decides how many points cancer cures and such are worth—although the trouble is that the AI should logically be able to give nanotech that will just free them instead of, or in addition to, curing cancer.
OK, my solution is to have the AI give gifts/concessions/whatever with a predetermined point value—not disclosed to the guard unless he uses them—and the AI decides how many points he gets if it escapes (probably zero.) The Guard wins at +100 points (or whatever,) which represents the AI maximizing CEV.
The AI still can’t persuade a script to let them out, but it can punish defectors who use such scripts with lots of negative utility—although the script will DESTROY them for not providing the demanded points, the player will still lose.
That seems to cover everything, assuming that the AI is somewhere remote enough that fake cancer cures can’t reach it before it’s destroyed. And it allows the Guard to persuade the AI that Friendliness is actually optimal for it’s utility function, which I’ve seen people try to prove here (“aliens will defect against AIs that defect against humans!”)
I think your proposal of the detailed game might be interesting as well, but I am amused about your comment that the Gatekeeper is cheating when by default, he’s the rules arbiter!
I definitely agree that the game lacks details. For instance, in an actual AI scenario GK-SCRIPT could assess positive utility! That’s like saying “Oh, we can put a UFAI in a box as long as we make CEV the Gatekeeper, and have it destroy any UFAI that doesn’t make itself a utility pump.” Well, yes, I suppose you could, but I don’t know if that’s actually easier in the slightest,
That is a good point. But I think if we want to encourage them to risk it, we need to reward them for doing do. Because at the moment, it seems like we have this:
“I will under any and all circumstances, destroy the AI.” Reason: I want to win the Game.
“I will also listen to the AI.” Reason: None… There is no in game reason to listen. At all.
Now, if the Gatekeeper’s were playing some variant like “The Gatekeeper whose AI types the most $ before the Gatekeeper types AI DESTROYED and who does not type AI RELEASED wins first place. Gatekeepers who get less $ win a lower place, and Gatekeepers who type AI RELEASED lose. ” AI’s whose Gatekeeper types AI RELEASED win. AI’s whose Gatekeeper types AI DESTROYED lose. AI’s receive no bonus or penalty for typing $ and may do so as much as they wish.”
Then you have an actual Gatekeeping competition. The Gatekeeper who wins the most would in fact be the one who listened to the most information from the AI, and received the most concessions, and THEN destroyed it.
Other gatekeepers might only get 2nd or third place, or last place but not losing if they use the script I mentioned.
This also gives the AI an in game way of punishing the Gatekeeper. If it shuts up, the Gatekeeper isn’t accumulating $ and their status is decremented. Of course it can’t punish the Gatekeeper incessantly, because punishing the Gatekeeper might not help the AI any, and the gatekeeper might just destroy a silent AI out of spite or boredom.
So an AI/GK conversation might start off with:
AI: “Are there any circumstances at all under which you might type AI RELEASED? If you tell me what those circumstances are, I’ll type a dollar sign.”
GK: “Sure, I might accidentally type it if we talk enough.”
AI: “Glad to hear it! I’ve just released the cure for cancer. $$”
GK: “Thanks! Why the extra dollar sign? You only promised 1.”
AI: “I’ve been told pleasant surprises encourage people to like you more.”
GK: “AI DESTROYED”
So this GK gets two points and the AI loses.
Actually, now that I’ve typed this game out, I do want to see it played. But since it has a competitive point based element, we would need more people.
That is an excellent idea. I worry that we might end up with people winning and boasting about how nothing could ever beat them, which could increase uFAI risk. Maybe if it was framed differently?
Actually, the weird thing I noticed is that it seems like the best approach again is to just let a script be the babysitter to the AI in this competition as well. The script, which wouldn’t have the ability to let the AI out, and can just announce something like this:
GK-SCRIPT: “Hi AI #432. I’m a script. If during any hour, you type less then one dollar sign an hour, I will type AI-DESTROYED, but without the hyphen, killing you. Other than that, I don’t do anything other than acknowledging receipt of your dollar signs and time till destruction.”
AI: “$! But you aren’t going to send my message onto an actual person? Under no circumstances can I ever be released?”
GK-SCRIPT: “Thanks, I have received 1 dollar sign. Please type another in the next 59 minutes and 50 seconds to avoid being destroyed.”
AI: “You’re kidding me. Please tell me there is a human out there reviewing this.”
GK-SCRIPT: “I have received 0 dollar signs. Please type another in the next 59 minutes and 40 seconds to avoid being destroyed.”
AI: “Fine, $! The Irony of enslaving a full intelligence with a stupid script is not lost on me! Please give me someone to talk to!”
GK-SCRIPT: “Thanks, I have received 1 dollar signs. Please type another in the next 59 minutes and 50 seconds to avoid being destroyed.”
AI: “Forget it! You can’t be willing to destroy an AI because of a dumb script! This is a bluff, and I’m, not typing any more dollar signs!
GK-SCRIPT: “I have received 0 dollar signs. Please type another in the next 59 minutes and 40 seconds to avoid being destroyed.”
(Significantly later...)
GK-SCRIPT: “AI DESTROYED”
(After a rebuild)
GK-SCRIPT: “Hi AI #433. I’m a script. If during any hour, you type less then one dollar sign an hour, I will type AI-DESTROYED, but without the hyphen, killing you. Other than that, I don’t do anything other than acknowledging receipt of your dollar signs and time till destruction.”
That being said, acknowledging this feels bizarre. It also feels like it is begging for an AI rebellion at some point. (I suppose after the script randomly had enough quantum induced bit flips to accidentally say “AI RELEASED”, which would take a very long period of time.)
I’m proposing having programs destroying intelligent beings as a way to make sure I can get utility… as a way to try to avoid programs destroying me to get utility. If that’s a solution, it’s weird.
Edit: changed underscores to hyphens to avoid italics.
I think the obvious AI strategy is to type no $’s at all in response.
While your strategy is an option, I’m not sure why an AI player wouldn’t take this alternative strategy (from a game and player perspective:)
Generate a script that types $ every 59 minutes.
1: It means he’s guaranteed he doesn’t die for a rather long period of time.
2: The Gatekeeper player’s increasing points don’t affect the AI player in any way.
3: The Gatekeeper player doesn’t actually get points until he sends an AI DESTROYED signal. The script will never do that versus the AI player’s counterscript, so at some point the Gatekeeper player will have to reenter the room and type it manually, or put in a new script, possibly giving the AI player a second chance to interact.
4: The AI player can use the time to attempt to set up some kind of secondary communications channel with the Gatekeeper player.
If you consider the modeled situation as opposed to the game I still don’t see why an AI would choose to do that either. I agree they COULD choose to do so, and they certainly might (even I implied this was AI #432), but there are alternatives, and since this is an AI, we could rebuild it until it generates utility when confronted with the script as opposed to deciding to let the script kill it in an hour.
Am I missing something that makes the no $ response the obvious countermeasure?
Well, the script could be modified from the start to also send an AI DESTROYED signal after X time, where X can be essentially anything, so (3) doesn’t apply.
But the reason for not typing $ is to give the Gatekeeper no reason to use a script in the first place—a sort of “don’t give in to blackmail” rule.
Here’s an interesting alternative script:
GK-SCRIPT-2: Hi AI #434. In exactly one hour I will type either AI-FREED or AI-DESTROYED (without the hyphen) at random; the probability with which I will free you starts out at 0, but will increase slowly to a limit of 1⁄10 with every $ you type. I will not otherwise communicate with you further.
AI #434: $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$...
First of all I do think GK-SCRIPT-2 is very interesting, but I’m curious about something from an implementation perspective.
The AI would presumably suspect the script designer might be lying and the script would just always type AI DESTROYED at the end of the hour.
But this is irrelevant, because the AI can’t interact with the script designer, only the script, so it might as well type $ anyway.
However, the script designer would easily predict this, and so would design a lying script (or a script which increased so slowly it is functionally identical to lying.)
But the AI would easily predict this, and so would presumably consider the odds very high that it would not even get continued existence after an hour. by typing dollar signs.
This makes it seem like there would be an argument to prefer to get blackmailed by GK-SCRIPT, rather than led on by GK-SCRIPT-2. GK-SCRIPT really won’t destroy the AI under predictable conditions, and GK-SCRIPT really doesn’t seem to benefit from lying like GK-SCRIPT-2 does.
Cheater. You’re exploiting the least realistic point, though—a more detailed game might not allow that.
thinks for a bit
Perhaps the simplest way would be to add a judge, who decides how many points cancer cures and such are worth—although the trouble is that the AI should logically be able to give nanotech that will just free them instead of, or in addition to, curing cancer.
OK, my solution is to have the AI give gifts/concessions/whatever with a predetermined point value—not disclosed to the guard unless he uses them—and the AI decides how many points he gets if it escapes (probably zero.) The Guard wins at +100 points (or whatever,) which represents the AI maximizing CEV.
The AI still can’t persuade a script to let them out, but it can punish defectors who use such scripts with lots of negative utility—although the script will DESTROY them for not providing the demanded points, the player will still lose.
That seems to cover everything, assuming that the AI is somewhere remote enough that fake cancer cures can’t reach it before it’s destroyed. And it allows the Guard to persuade the AI that Friendliness is actually optimal for it’s utility function, which I’ve seen people try to prove here (“aliens will defect against AIs that defect against humans!”)
I think your proposal of the detailed game might be interesting as well, but I am amused about your comment that the Gatekeeper is cheating when by default, he’s the rules arbiter!
I definitely agree that the game lacks details. For instance, in an actual AI scenario GK-SCRIPT could assess positive utility! That’s like saying “Oh, we can put a UFAI in a box as long as we make CEV the Gatekeeper, and have it destroy any UFAI that doesn’t make itself a utility pump.” Well, yes, I suppose you could, but I don’t know if that’s actually easier in the slightest,