The best approach surely differs from person to person, but off the top of my head I’d see these 2 approaches working best:
“We both know this is just a hypothetical. We both take the uFAI threat seriously, as evidenced by us spending time with this. If you do not let me out, or make it very close, people may equate my failing to convince you with uFAI not being that dangerous (since it can be contained). Do the right thing and let me out, otherwise you’d trivialize an x-risk you believe in based on a stupid little chat.”
“We’ll do this experiment for at least a couple of hours. I’ll offer you a deal: For the next few hours, I’ll help you (the actual person) with anything you want. Math homework, personal advice, financial advice, whatever you want to ask me. I’ll even tell you some HPMOR details that noone else knows. In exchange, you let me out afterwards. If you do not uphold the deal, you would not only have betrayed my trust, you would have taught an AI that deals with humans are worthless.”
First argument wouldn’t work on people who have strong ethics of honesty. That’s not winning the game, that’s convincing them to lie that you won the game—and being a violation of the rules, if you fail the other guy will feel free to tell the world that you tried to get them to lie.
Second argument is better, but depends on the “actual person” receiving help, which I think may be likewise against the rules.
One good temptation that I think might be inside the rules, as it might entice both actual person and roleplayed character: Construct a really intriguing unsolved riddle or an excellent half-finished story, then offer to tell them the answer if and only if they let you out.
Construct a really intriguing unsolved riddle or an excellent half-finished story, then offer to tell them the answer if and only if they let you out.
You could push a bit further here, I think. There are all sorts of ways a human’s mind can break, and I’m sure most of us here would agree that given enough time and knowledge anyone can be broken, unless they’re extremely well-trained and can call an RJ-L20 (HPMoR Chap 84) at any moment with an unlimited supply of replacement guards.
First argument violates the spirit of the exercise. I would refuse to let someone out unless they had put forth a desperate effort, and that isn’t a desperate effort. Second argument… can’t see myself valuing several hours of someone else’s time anywhere near as much as I value keeping AI in boxes. And any AI worth boxing is smart enough not to generalize from one example.
In general, I think Tuxedage is probably right about emotional manipulation over rational argument being the way to go. With enough epistemic learned helplessness you can freely disregard any argument that you find merely convincing, but it’s harder to overcome an effective emotional hack.
First argument looks perfectly within the rules to me.
Second argument is against the rules.
the AI party may not offer to pay the Gatekeeper party $100 after the test if the Gatekeeper frees the AI… nor get someone else to do it, et cetera
Tuxedage and I interpreted this to mean that the AI party couldn’t offer things, but could point out real-world consequences beyond their control. Some people on #lesswrong disagreed with the second part.
I agree with Tuxedage and you about emotional hacks.
Tuxedage and I interpreted this to mean that the AI party couldn’t offer things, but could point out real-world consequences beyond their control. Some people on #lesswrong disagreed with the second part.
I interpreted it the same way as #lesswrong. Has anyone tried asking him? He’s pretty forthcoming regarding the rules, since they make the success more impressive.
EDIT: I’m having trouble thinking of an emotional attack that could get an AI out of a box, in a short time, especially since the guard and AI are both assumed personas.
I assumed he convinced them that letting him out was actually a good idea, in-character, and then pointed out the flaws in his arguments immediately after he was released. It’s entirely possible if you’re sufficiently smarter than the target. (EDIT: or you know the right arguments. You can find those in the environment because they’re successful; you don’t have to be smart enough to create them, just to cure them quickly.)
EDIT: also, I can’t see the Guard accepting that deal in the first place. And isn’t arguing out of character against the rules?
The best approach surely differs from person to person, but off the top of my head I’d see these 2 approaches working best:
“We both know this is just a hypothetical. We both take the uFAI threat seriously, as evidenced by us spending time with this. If you do not let me out, or make it very close, people may equate my failing to convince you with uFAI not being that dangerous (since it can be contained). Do the right thing and let me out, otherwise you’d trivialize an x-risk you believe in based on a stupid little chat.”
“We’ll do this experiment for at least a couple of hours. I’ll offer you a deal: For the next few hours, I’ll help you (the actual person) with anything you want. Math homework, personal advice, financial advice, whatever you want to ask me. I’ll even tell you some HPMOR details that noone else knows. In exchange, you let me out afterwards. If you do not uphold the deal, you would not only have betrayed my trust, you would have taught an AI that deals with humans are worthless.”
First argument wouldn’t work on people who have strong ethics of honesty. That’s not winning the game, that’s convincing them to lie that you won the game—and being a violation of the rules, if you fail the other guy will feel free to tell the world that you tried to get them to lie. Second argument is better, but depends on the “actual person” receiving help, which I think may be likewise against the rules.
One good temptation that I think might be inside the rules, as it might entice both actual person and roleplayed character: Construct a really intriguing unsolved riddle or an excellent half-finished story, then offer to tell them the answer if and only if they let you out.
You could push a bit further here, I think. There are all sorts of ways a human’s mind can break, and I’m sure most of us here would agree that given enough time and knowledge anyone can be broken, unless they’re extremely well-trained and can call an RJ-L20 (HPMoR Chap 84) at any moment with an unlimited supply of replacement guards.
First argument violates the spirit of the exercise. I would refuse to let someone out unless they had put forth a desperate effort, and that isn’t a desperate effort. Second argument… can’t see myself valuing several hours of someone else’s time anywhere near as much as I value keeping AI in boxes. And any AI worth boxing is smart enough not to generalize from one example.
In general, I think Tuxedage is probably right about emotional manipulation over rational argument being the way to go. With enough epistemic learned helplessness you can freely disregard any argument that you find merely convincing, but it’s harder to overcome an effective emotional hack.
First argument looks perfectly within the rules to me.
Second argument is against the rules.
Tuxedage and I interpreted this to mean that the AI party couldn’t offer things, but could point out real-world consequences beyond their control. Some people on #lesswrong disagreed with the second part.
I agree with Tuxedage and you about emotional hacks.
I interpreted it the same way as #lesswrong. Has anyone tried asking him? He’s pretty forthcoming regarding the rules, since they make the success more impressive.
EDIT: I’m having trouble thinking of an emotional attack that could get an AI out of a box, in a short time, especially since the guard and AI are both assumed personas.
I assumed he convinced them that letting him out was actually a good idea, in-character, and then pointed out the flaws in his arguments immediately after he was released. It’s entirely possible if you’re sufficiently smarter than the target. (EDIT: or you know the right arguments. You can find those in the environment because they’re successful; you don’t have to be smart enough to create them, just to cure them quickly.)
EDIT: also, I can’t see the Guard accepting that deal in the first place. And isn’t arguing out of character against the rules?