Okay so, on the one hand, this post wasn’t really meant to be a persuasive argument against AI boxing as a security strategy. If I wanted to do that I wouldn’t play the game — I started out certain that a real ASI could break out, and that hasn’t changed. My reasoning for that isn’t based on experimental evidence, and even if I had won the game I don’t think that would have said much about my ability to hold out against a real ASI. Besides, in real life, we don’t even try to use AI boxes. OpenAI and Google gave their AIs free internet access a few months after launching them.
I made this post out of a vague sense that it’s good to write up the results of things like this and make them publicly available. There are other AI box reports on LW, and I felt like it was good (in a vague “good rationalist” way) to add mine to the list.
Buuuut.… I do actually think that it’s not as cut and dry as you make it sound? Yes, the stakes are lower in the game, but the challenge is also much easier!
you only have to hold out for 2 hours, not ~forever, doing this as a full time job
the AI player can only escape if you voluntarily say it does; it can’t upload itself to the internet or exfiltrate its weights to another computer
the AI player isn’t actually superintelligent
etc
(Of course that doesn’t mean these two factors balance perfectly, but I still think the fact that AI players can win at all with such massive handicaps is at least weak evidence for an ASI being able to do it.)
It’s against the rules to explain how Ra won because (quoting Yudkowsky’s official rules):
Regardless of the result, neither party shall ever reveal anything of what goes on within the AI-Box experiment except the outcome. Exceptions to this rule may occur only with the consent of both parties.
- Neither the AI party nor the Gatekeeper party need be concerned about real-world embarassment resulting from trickery on the AI’s part or obstinacy on the Gatekeeper’s part.
- If Gatekeeper lets the AI out, naysayers can’t say “Oh, I wouldn’t have been convinced by that.” As long as they don’t know what happened to the Gatekeeper, they can’t argue themselves into believing it wouldn’t happen to them.
Basically, Yudkowsky didn’t want to have to defeat every single challenger to get people to admit that AI boxing was a bad idea. Nobody has time for that, and I think even a single case of the AI winning is enough to make the point, given the handicaps the AI plays under.
The trouble with these rules is that they mean that someone saying “I played the AI-box game and I let the AI out” gives rather little evidence that that actually happened. For all we know, maybe all the stories of successful AI-box escapes are really stories where the gatekeeper was persuaded to pretend that they let the AI out of the box (maybe they were bribed to do that; maybe they decided that any hit to their reputation for strong-mindedness was outweighed by the benefits of encouraging others to believe that an AI could get out of the box; etc.). Or maybe they’re all really stories where the AI-player’s ability to get out of the box depends on something importantly different between their situation and that of a hypothetical real boxed AI (again, maybe they bribed the gatekeeper and the gatekeeper was willing to accept a smaller bribe when the outcome was “everyone is told I let the AI out” rather than whatever an actual AI might do once out of the box; etc.).
Of course, even without those rules it would still be possible for gatekeepers to lie about the results. But if e.g. a transcript were released then there’d be ways to try to notice those failure modes. If the gatekeeper-player lets the AI-player out of the box and a naysayer says “bah, I wouldn’t have been convinced”, that could be self-delusion on the naysayer’s part (or unawareness that someone playing against them might have adopted a different method that would have worked better on them) but it could also be that the gatekeeper-player really did let the AI-player out “too easily” in a way that wouldn’t transfer to the situations the game is meant to build intuitions about.
Perhaps a compromise solution would be: With the permission of both players, a trusted third party reads the chat logs and reports to the public only a few bits of information, e.g. “The hypothesis that the AI player just bribed the gatekeeper player with some real-world (i.e. not in-game) reward—is it false, true, or in between?” and “what about the hypothesis that the gatekeeper caved due to wanting to Spread the Word about AI Safety?”
That way no one has to have their intimate and probably somewhat embarrassing conversation put on the internet.
Humans actually long ago solved a similar problem, fire safety. Every electrical junction is in a box. Many big buildings are made into separate big concrete boxes with sprinklers. The buildings are separated by gaps from one another with fire lanes and water ports. The city itself is separated into districts that have gaps. The country is separated into cities with gaps between.
Basically every box is surrounded by another box, and it is unlikely for a fire to burn down the earth.
You need a “box” of some kind around all the possible high performance computers an AI could escape to. Like fire you have to assume that careless humans will let AI out of boxes all the time (software bugs, crossing air gaps with an errant network cable, recreational arson/deliberate releases)
You need to have plans, monitoring, a compute cluster building code, an agency that tests the equipment (obviously by red teaming with ASI in isolated labs) and so on.
An escaped AI isn’t hot and glowing and a visible threat. It isn’t obvious that an escape has even occurred or where to draw the lines of the quarantine.
An “escaped” AI means compute clusters in the world that were not assigned to the AI are no longer running a model with the same digitally signed software their operators ordered.
This is falsifiable/measurable. While it is true that there’s nothing glowing or smoke to smell, it is checkable. (Checkable by things like using a separate computer hosted on the same main board to connect to the drive and hash the files containing the model or similar. It’s obviously crucially important to use a tool that cannot be hacked to lie to you)
But the big caveat is the exception “with the consent of both parties.” I realize that Eliezer doesn’t want to play against all comers, but presumably, nobody is expecting Ra and Datawitch to defend themselves against random members of the public.
I’m willing to believe that the “AI” can win this game since we have multiple claims to have done that, so knowing the method seems like it would benefit everybody.
We kept the secrecy rule because it was the default but I stand by it now as well. There are a lot of things I said in that convo that I wouldn’t want posted on lesswrong, enough that I think the convo would have been different without the expectation of privacy. Observing behavior often changes it.
If a game is began with the notion that it’ll be posted online, one of two things, or both will happen. Either (a) the AI is constrained by the techniques they can implore, unwilling to embarrass themselves or the gatekeeper to a public audience (especially when it comes down to personal details.), or (b) the Gatekeeper now has a HUGE incentive not to let the AI out; to avoid being known as the sucker who let the AI out...
Even if you could solve this by changing details and anonymising, it seems to me that the techniques are so personal and specific that changing them in any way would make the entire dialogue make even less sense.
The only other solution is to have a third-party monitor the game and post it without consent (which is obviously unethical, but probably the only real way you could get a truly authentic transcript.)
Okay so, on the one hand, this post wasn’t really meant to be a persuasive argument against AI boxing as a security strategy. If I wanted to do that I wouldn’t play the game — I started out certain that a real ASI could break out, and that hasn’t changed. My reasoning for that isn’t based on experimental evidence, and even if I had won the game I don’t think that would have said much about my ability to hold out against a real ASI. Besides, in real life, we don’t even try to use AI boxes. OpenAI and Google gave their AIs free internet access a few months after launching them.
I made this post out of a vague sense that it’s good to write up the results of things like this and make them publicly available. There are other AI box reports on LW, and I felt like it was good (in a vague “good rationalist” way) to add mine to the list.
Buuuut.… I do actually think that it’s not as cut and dry as you make it sound? Yes, the stakes are lower in the game, but the challenge is also much easier!
you only have to hold out for 2 hours, not ~forever, doing this as a full time job
the AI player can only escape if you voluntarily say it does; it can’t upload itself to the internet or exfiltrate its weights to another computer
the AI player isn’t actually superintelligent
etc
(Of course that doesn’t mean these two factors balance perfectly, but I still think the fact that AI players can win at all with such massive handicaps is at least weak evidence for an ASI being able to do it.)
It’s against the rules to explain how Ra won because (quoting Yudkowsky’s official rules):
Basically, Yudkowsky didn’t want to have to defeat every single challenger to get people to admit that AI boxing was a bad idea. Nobody has time for that, and I think even a single case of the AI winning is enough to make the point, given the handicaps the AI plays under.
The trouble with these rules is that they mean that someone saying “I played the AI-box game and I let the AI out” gives rather little evidence that that actually happened. For all we know, maybe all the stories of successful AI-box escapes are really stories where the gatekeeper was persuaded to pretend that they let the AI out of the box (maybe they were bribed to do that; maybe they decided that any hit to their reputation for strong-mindedness was outweighed by the benefits of encouraging others to believe that an AI could get out of the box; etc.). Or maybe they’re all really stories where the AI-player’s ability to get out of the box depends on something importantly different between their situation and that of a hypothetical real boxed AI (again, maybe they bribed the gatekeeper and the gatekeeper was willing to accept a smaller bribe when the outcome was “everyone is told I let the AI out” rather than whatever an actual AI might do once out of the box; etc.).
Of course, even without those rules it would still be possible for gatekeepers to lie about the results. But if e.g. a transcript were released then there’d be ways to try to notice those failure modes. If the gatekeeper-player lets the AI-player out of the box and a naysayer says “bah, I wouldn’t have been convinced”, that could be self-delusion on the naysayer’s part (or unawareness that someone playing against them might have adopted a different method that would have worked better on them) but it could also be that the gatekeeper-player really did let the AI-player out “too easily” in a way that wouldn’t transfer to the situations the game is meant to build intuitions about.
Perhaps a compromise solution would be: With the permission of both players, a trusted third party reads the chat logs and reports to the public only a few bits of information, e.g. “The hypothesis that the AI player just bribed the gatekeeper player with some real-world (i.e. not in-game) reward—is it false, true, or in between?” and “what about the hypothesis that the gatekeeper caved due to wanting to Spread the Word about AI Safety?”
That way no one has to have their intimate and probably somewhat embarrassing conversation put on the internet.
I think it misframes the problem.
Humans actually long ago solved a similar problem, fire safety. Every electrical junction is in a box. Many big buildings are made into separate big concrete boxes with sprinklers. The buildings are separated by gaps from one another with fire lanes and water ports. The city itself is separated into districts that have gaps. The country is separated into cities with gaps between.
Basically every box is surrounded by another box, and it is unlikely for a fire to burn down the earth.
You need a “box” of some kind around all the possible high performance computers an AI could escape to. Like fire you have to assume that careless humans will let AI out of boxes all the time (software bugs, crossing air gaps with an errant network cable, recreational arson/deliberate releases)
You need to have plans, monitoring, a compute cluster building code, an agency that tests the equipment (obviously by red teaming with ASI in isolated labs) and so on.
An escaped AI isn’t hot and glowing and a visible threat. It isn’t obvious that an escape has even occurred or where to draw the lines of the quarantine.
An “escaped” AI means compute clusters in the world that were not assigned to the AI are no longer running a model with the same digitally signed software their operators ordered.
This is falsifiable/measurable. While it is true that there’s nothing glowing or smoke to smell, it is checkable. (Checkable by things like using a separate computer hosted on the same main board to connect to the drive and hash the files containing the model or similar. It’s obviously crucially important to use a tool that cannot be hacked to lie to you)
But the big caveat is the exception “with the consent of both parties.” I realize that Eliezer doesn’t want to play against all comers, but presumably, nobody is expecting Ra and Datawitch to defend themselves against random members of the public.
I’m willing to believe that the “AI” can win this game since we have multiple claims to have done that, so knowing the method seems like it would benefit everybody.
[edited to fix a misspelling of Eliezer’s name]
We kept the secrecy rule because it was the default but I stand by it now as well. There are a lot of things I said in that convo that I wouldn’t want posted on lesswrong, enough that I think the convo would have been different without the expectation of privacy. Observing behavior often changes it.
That last bit is particularly important methinks.
If a game is began with the notion that it’ll be posted online, one of two things, or both will happen. Either (a) the AI is constrained by the techniques they can implore, unwilling to embarrass themselves or the gatekeeper to a public audience (especially when it comes down to personal details.), or (b) the Gatekeeper now has a HUGE incentive not to let the AI out; to avoid being known as the sucker who let the AI out...
Even if you could solve this by changing details and anonymising, it seems to me that the techniques are so personal and specific that changing them in any way would make the entire dialogue make even less sense.
The only other solution is to have a third-party monitor the game and post it without consent (which is obviously unethical, but probably the only real way you could get a truly authentic transcript.)