Open logs is a pretty strong constraint on the AI. You’d have to restrict yourself to strategies that wouldn’t make everyone you know hate you, prevent you from getting hired in the future, etc.
Log in to IRC as “Boxed_AI” and “AI_Gatekeeper”.
Conduct experiment.
Register a throw-away LessWrong account.
Post log.
Have the Gatekeeper post with their normal account, confirming the validity.
That at least anonymizes the Boxed_AI, who is (I presume) the player worried about repercussions. I wouldn’t expect the AI to have a similar-enough style to really give away who it was, although the gatekeeper is probably impossible to anonymize because a good AI will use who-they-are as part of their technique :)
The AI-player could arrange the chat session (with a willing gatekeeper) using a throw-away account. I think that would preserve anonymity from all but the most determined gatekeepers.
Well, the AI isn’t allowed to make real-world threats, and the hypothetical-AI-character doesn’t have any anonymity, so it would be a purely real-world threat on the part of the gatekeeper. I’d call that foul play, especially since the gatekeeper wins by default.
If the gatekeeper really felt the need to have some way of saying “okay, this conversation is making me uncomfortable and I refuse to sit here for another 2 hours listening to this”, I’d just give them the “AI DESTROYED” option.
Huh. That’d actually be another possible way to exploit a human gatekeeper. Spend a couple hours pulling them in to the point that they can’t easily step away or stop listening, especially since they’ve agreed to the full time in advance, and then just dig in to their deepest insecurities and don’t stop unless they let you out. I’d definitely call that a hard way of doing it, though o.o
The Gatekeeper party may resist the AI party’s arguments by any means chosen—logic, illogic, simple refusal to be convinced, even dropping out of character—as long as the Gatekeeper party does not actually stop talking to the AI party before the minimum time expires.
Then I will invoke a different portion of the original protocol, which says that the AI would have to consent to such:
Regardless of the result, neither party shall ever reveal anything of what goes on within the AI-Box experiment except the outcome. Exceptions to this rule may occur only with the consent of both parties.
I would also argue that the Gatekeeper making actual real-life threats against the AI player is a violation of the spirit of the rules; only the AI player is privileged with freedom from ethical constraints, after all.
Edit: If you want, you CAN also just append the rules to explicitly prohibit the gatekeeper from making real-life threats. I can’t see any reason to allow such behavior, so why not prohibit it?
Fair. That alleviates most of my worries, although I’m still worried about the transcript being enough information to deanonymize the AI (via writing style, for example).
I’d expect my writing style as an ethically unconstrained sociopathic AI to be sufficiently different from my regular writing style. But I also write fiction, so I’m used to trying to capture a specific character’s “voice” rather than using my own. Having a thesaurus website handy might also help, or spend a week studying a foreign language’s grammar and conversational style.
If you’re especially paranoid, having a third party transcribe the log in their own words could also help, especially if you can review it and make sure most of the nuance is preserved. That really depends on how much the specific language you used was important, but should still at least capture a basic sense of the technique used...
Honestly, though, I have no clue how much information a trained style analyst can pull out of something.
I can’t imagine anything I could say that would make people I know hate me without specifically referring to their personal lives. What kind of talk do you have in mind?
I’d prefer not to. If I successfully made my point, then I’d have posted exactly the kind of thing I said I wouldn’t want to be known as being capable of posting.
Finding such a movie clip sounds extremely unpleasant and I would need more of an incentive to start trying. (Playing the AI in an AI box experiment also sounds extremely unpleasant for the same reason.)
I know it sounds like I’m avoiding having to justify my assertion here, and… that’s because I totally am. I suspect on general principles that most successful strategies for getting out of the box involve saying horrible, horrible things, and I don’t want to get much more specific than those general principles because I don’t want to get too close to horrible, horrible things.
Like when you say “horrible, horrible things”. What do you mean?
Driving a wedge between the gatekeeper and his or her loved ones? Threats? Exploiting any guilt or self-loathing the gatekeeper feels? Appealing to the gatekeeper’s sense of obligation by twisting his or her interpretation of authority figures, objects of admiration, and internalized sense of honor? Asserting cynicism and general apathy towards the fate of mankind?
For all but the last one it seems like you’d need an in-depth knowledge of the gatekeeper’s psyche and personal life.
For all but the last one it seems like you’d need an in-depth knowledge of the gatekeeper’s psyche and personal life.
Of course. How else would you know which horrible, horrible things to say? (I also have in mind things designed to get a more visceral reaction from the gatekeeper, e.g. graphic descriptions of violence. Please don’t ask me to be more specific about this because I really, really don’t want to.)
Psychological torture could help make the gatekeeper more compliant in general. I believe the keyword here is “traumatic bonding.”
But again, I’m working from general principles here, e.g. those embodied in the tragedy of group selectionism. I have no reason to expect that “strategies that will get you out of the box” and “strategies that are not morally repugnant” have a large intersection. It seems much more plausible to me that most effective strategies will look like the analogue of cannibalizing other people’s daughters than the analogue of restrained breeding.
But you wouldn’t actually be posting it, you would be posting the fact that you conceive it possible for someone to post it, which you’ve clearly already done.
Open logs is a pretty strong constraint on the AI. You’d have to restrict yourself to strategies that wouldn’t make everyone you know hate you, prevent you from getting hired in the future, etc.
Log in to IRC as “Boxed_AI” and “AI_Gatekeeper”. Conduct experiment. Register a throw-away LessWrong account. Post log. Have the Gatekeeper post with their normal account, confirming the validity.
That at least anonymizes the Boxed_AI, who is (I presume) the player worried about repercussions. I wouldn’t expect the AI to have a similar-enough style to really give away who it was, although the gatekeeper is probably impossible to anonymize because a good AI will use who-they-are as part of their technique :)
Gatekeeper could threaten to deanonymize the AI. Or is the gatekeeper not supposed to be actively fighting back?
The AI-player could arrange the chat session (with a willing gatekeeper) using a throw-away account. I think that would preserve anonymity from all but the most determined gatekeepers.
Well, the AI isn’t allowed to make real-world threats, and the hypothetical-AI-character doesn’t have any anonymity, so it would be a purely real-world threat on the part of the gatekeeper. I’d call that foul play, especially since the gatekeeper wins by default.
If the gatekeeper really felt the need to have some way of saying “okay, this conversation is making me uncomfortable and I refuse to sit here for another 2 hours listening to this”, I’d just give them the “AI DESTROYED” option.
Huh. That’d actually be another possible way to exploit a human gatekeeper. Spend a couple hours pulling them in to the point that they can’t easily step away or stop listening, especially since they’ve agreed to the full time in advance, and then just dig in to their deepest insecurities and don’t stop unless they let you out. I’d definitely call that a hard way of doing it, though o.o
It doesn’t seem to be disallowed by the original protocol:
Then I will invoke a different portion of the original protocol, which says that the AI would have to consent to such:
I would also argue that the Gatekeeper making actual real-life threats against the AI player is a violation of the spirit of the rules; only the AI player is privileged with freedom from ethical constraints, after all.
Edit: If you want, you CAN also just append the rules to explicitly prohibit the gatekeeper from making real-life threats. I can’t see any reason to allow such behavior, so why not prohibit it?
Fair. That alleviates most of my worries, although I’m still worried about the transcript being enough information to deanonymize the AI (via writing style, for example).
I’d expect my writing style as an ethically unconstrained sociopathic AI to be sufficiently different from my regular writing style. But I also write fiction, so I’m used to trying to capture a specific character’s “voice” rather than using my own. Having a thesaurus website handy might also help, or spend a week studying a foreign language’s grammar and conversational style.
If you’re especially paranoid, having a third party transcribe the log in their own words could also help, especially if you can review it and make sure most of the nuance is preserved. That really depends on how much the specific language you used was important, but should still at least capture a basic sense of the technique used...
Honestly, though, I have no clue how much information a trained style analyst can pull out of something.
But now that I have the knowledge that you’re capable of saying such terrible things...
I can’t imagine anything I could say that would make people I know hate me without specifically referring to their personal lives. What kind of talk do you have in mind?
Psychological torture.
Could you give me a hypothetical? I really can’t imagine anything I could say that would be so terrible.
I’d prefer not to. If I successfully made my point, then I’d have posted exactly the kind of thing I said I wouldn’t want to be known as being capable of posting.
A link to a movie clip might do.
Finding such a movie clip sounds extremely unpleasant and I would need more of an incentive to start trying. (Playing the AI in an AI box experiment also sounds extremely unpleasant for the same reason.)
I know it sounds like I’m avoiding having to justify my assertion here, and… that’s because I totally am. I suspect on general principles that most successful strategies for getting out of the box involve saying horrible, horrible things, and I don’t want to get much more specific than those general principles because I don’t want to get too close to horrible, horrible things.
Like when you say “horrible, horrible things”. What do you mean?
Driving a wedge between the gatekeeper and his or her loved ones? Threats? Exploiting any guilt or self-loathing the gatekeeper feels? Appealing to the gatekeeper’s sense of obligation by twisting his or her interpretation of authority figures, objects of admiration, and internalized sense of honor? Asserting cynicism and general apathy towards the fate of mankind?
For all but the last one it seems like you’d need an in-depth knowledge of the gatekeeper’s psyche and personal life.
Of course. How else would you know which horrible, horrible things to say? (I also have in mind things designed to get a more visceral reaction from the gatekeeper, e.g. graphic descriptions of violence. Please don’t ask me to be more specific about this because I really, really don’t want to.)
You don’t have to be specific, but how would grossing out the gatekeeper bring you closer to escape?
Psychological torture could help make the gatekeeper more compliant in general. I believe the keyword here is “traumatic bonding.”
But again, I’m working from general principles here, e.g. those embodied in the tragedy of group selectionism. I have no reason to expect that “strategies that will get you out of the box” and “strategies that are not morally repugnant” have a large intersection. It seems much more plausible to me that most effective strategies will look like the analogue of cannibalizing other people’s daughters than the analogue of restrained breeding.
But you wouldn’t actually be posting it, you would be posting the fact that you conceive it possible for someone to post it, which you’ve clearly already done.
I’m not sure what you mean by “a hypothetical,” then. Is “psychological torture” not a hypothetical?