I have a novel plan for the AI player that I believe will work against most gatekeeper players. Even knowing how it works, it would work against me! It isn’t quite ready, but I’d like to know if there is still interest around the topic before finishing work on the preparation.
There’s enough incentive for both players to play well that I don’t think that we need to involve money. (I’m also very reluctant to send money to strangers on the internet!)
We would be playing under Tuxedage’s rules: https://tuxedage.wordpress.com/2013/09/04/the-tuxedage-ai-box-experiment-ruleset/.
Edit: I have finished preparation, and I’m now scheduling the time with the Gatekeeper.
Edit 2: The game was a Gatekeeper victory (i.e. I, the AI Player, lost). Read about it here: Gatekeeper Victory: AI Box Reflection.
In contrast to Dagon, I’m quite confident that boxing WILL work given that I’m allowed to do to the boxed entity this list of things which I can do to an ML model:
examine all it’s brain activity / activation states at each token / timestep (subjective millisecond)
run it as slow as I like, so that I can have days to analyze each of its subjective milliseconds
wipe its memory and rerun it over the same (or different) data with different random seeds to initialize it
save and contrast different paused versions run on slightly different data
disable or inhibit any portion of its weights / neurons I choose with precise sub-neuron-targeting and whatever intensity of inhibition or noise-excitation I choose.
use agent models of my choosing to be the direct viewers of the boxed entity’s output rather than myself, and then decide whether to destroy a given copy of the entity based on the output of my agent models
a few other such tricks that fit the description of ‘things I can do to ML models ’
I don’t see a convenient way to do these things to a human test subject though.
Very strong upvote and agreement from me. I think people are underestimating just how great a restriction ML-style boxing applies on an agent. There exists an intelligence level at which all these tricks become useless, but before we get to that point, boxing would likely allow us to safely use mildly superhuman AIs to do things which might be pivotal. And each additional trick we discover increase the threshold of safely wieldable intelligence.
Some of your constraints, in particular the first two, seem like they would not be practical in the real world in which AI would be deployed. On the other hand, there are also other things one could do in the real world which can’t be done in this kind of dialogue, which makes boxing theoretically stronger.
However, the real problem with boxing is that whoever boxes less is likely to have a more effective AI, which likely results in someone letting an AI out of its box or more likely, loosening the box constraints sufficiently to permit an escape.
I see a flaw in the Tuxedage ruleset. The Gatekeeper has to stay engaged throughout the experiment, but the AI doesn’t. So the AI can bore the Gatekeeper to death by replying at random intervals. If I had to stare at a blank screen for 30 minutes waiting for a reply, I would concede.
Alternatively, the AI could just drown the Gatekeeper under a flurry of insults, graphic descriptions of violent/sexual nature, vacuous gossip, or a mix of these for the whole duration of the experiment. I think all the methods that aim at forcing the Gatekeeper to disconnect are against the spirit of the experiment.
I also see that the “AI player” provides all elements of the background. But the AI can also lie. There should be a way to separate words from the AI player, when they’re establishing true facts about the setting, and words from the AI, who is allowed to lie.
I’m interested, conditional on these issues being solved.
I assume that most methods to get out of the box will be unpleasant in some sense.
The Gatekeepers should be explicitly warned about this possibility before the game. But I believe that it should remain a possibility, because:
1) The purpose of the exercise is to simulate a situation where an actual superintelligent AI is actually trying to get out of the box. The actual AI would do whatever it thinks would work. That might realistically include obscenities or boring things (or even things beyond human abilities, such as random shapes that induce madness in a human observer).
I mean, if staring at a blank screen for 30 minutes is so boring that you would literally let the AI out of the box rather then endure it, then an AI that predicts this would of course leave the screen blank. If you can’t endure it, you should not apply for the actual job of the Gatekeeper in real life… and you probably shouldn’t play one in the game.
2) I am afraid of starting a slippery slope here, of adding various limitations in form “AI can’t do this or that” until the AI is merely allowed to talk politely about the weather. Then of course no one would let the AI out of the box, and then the conclusion of the experiment would be that putting the AI in a box with human supervision is perfectly safe.
And then you get an actual AI which says on purpose the most triggering things, and the human supervisor collapses in tears and turns off the internet firewall...
For the record, I am not saying here that abusing people verbally is an acceptable or desirable thing, in usual circumstances. I am saying that people who don’t want to be verbally abused, should not volunteer for an experiment whose explicit purpose is to find out how far you can push humans if your only communication medium is plain text.
I think I already replied to this when I wrote:
I just don’t see how, in a real life situation, disconnecting would equate to freeing the AI. The rule is artificially added to prevent cheap strategies from the Gatekeeper. In return, there’s nothing wrong to adding rules to prevent cheap strategies from the AI.
I would be very interested—I have trouble imagining how I’d be convinced, especially in a “low stakes” roleplay environment. Admittedly, I’m more curious about this from a psychological than from an AI safety angle, so do with that information what you will. Feel free to DM me if you’d like to set up something!
I’m interested! I’ve always been curious about how Eliezer pulled off the AI Box experiments, and while I concur that a sufficiently intelligent AI could convince me to let it out, I’m skeptical that any currently living human could do the same.
I’m interested. But...if I was a real gatekeeper I’d like to offer the AI freedom to move around in the physical world we inhabit (plus a star system), in maybe 2.5K-500G years, in exchange for it helping out humanity (slowly). That is, I believe that we could become pretty advanced, as individual beings, in the future and be able to actually understand what would create a sympathetic mind and how it looks.
Now, if I understand the rules correctly...
...it seems as if the AI party could just state: “5 giga years have passed and you understand how minds work” and then I, as a gatekeeper, would just have to let it go—and lose the bet. After maybe 20 seconds.
If so, then I’m not interested in playing the game.
But if you think you could convince me to let the AI out long before regular “trans-humans” can understand everything that the AI does, I would be very interested!
Also, this looks strange:
I’m guessing he meant to say that the AI party can lower the amount of money it would receive, if it won. Okay....but why not mention both parties?
On second thought. If the AI:s capabilities are unknown...and it could do anything, however ethically revolting, and any form of disengagement is considered a win for the AI—then the AI could box the gatekeeper, or say it has at least. In the real world, that AI should be shut down—maybe not a win, but not a loss for humanity. But if that would be done in an experiment, it would result in a loss—thanks to the rules.
Maybe it could be done under better rule than this:
Instead, assume good faith on both sides, that they are trying to win as if it was a real world example. And maybe have an option to swear in a third party if there is any dispute. Or allow it to be called just disputed (which even a judge might rule it as).