The trouble with these rules is that they mean that someone saying “I played the AI-box game and I let the AI out” gives rather little evidence that that actually happened. For all we know, maybe all the stories of successful AI-box escapes are really stories where the gatekeeper was persuaded to pretend that they let the AI out of the box (maybe they were bribed to do that; maybe they decided that any hit to their reputation for strong-mindedness was outweighed by the benefits of encouraging others to believe that an AI could get out of the box; etc.). Or maybe they’re all really stories where the AI-player’s ability to get out of the box depends on something importantly different between their situation and that of a hypothetical real boxed AI (again, maybe they bribed the gatekeeper and the gatekeeper was willing to accept a smaller bribe when the outcome was “everyone is told I let the AI out” rather than whatever an actual AI might do once out of the box; etc.).
Of course, even without those rules it would still be possible for gatekeepers to lie about the results. But if e.g. a transcript were released then there’d be ways to try to notice those failure modes. If the gatekeeper-player lets the AI-player out of the box and a naysayer says “bah, I wouldn’t have been convinced”, that could be self-delusion on the naysayer’s part (or unawareness that someone playing against them might have adopted a different method that would have worked better on them) but it could also be that the gatekeeper-player really did let the AI-player out “too easily” in a way that wouldn’t transfer to the situations the game is meant to build intuitions about.
Perhaps a compromise solution would be: With the permission of both players, a trusted third party reads the chat logs and reports to the public only a few bits of information, e.g. “The hypothesis that the AI player just bribed the gatekeeper player with some real-world (i.e. not in-game) reward—is it false, true, or in between?” and “what about the hypothesis that the gatekeeper caved due to wanting to Spread the Word about AI Safety?”
That way no one has to have their intimate and probably somewhat embarrassing conversation put on the internet.
Humans actually long ago solved a similar problem, fire safety. Every electrical junction is in a box. Many big buildings are made into separate big concrete boxes with sprinklers. The buildings are separated by gaps from one another with fire lanes and water ports. The city itself is separated into districts that have gaps. The country is separated into cities with gaps between.
Basically every box is surrounded by another box, and it is unlikely for a fire to burn down the earth.
You need a “box” of some kind around all the possible high performance computers an AI could escape to. Like fire you have to assume that careless humans will let AI out of boxes all the time (software bugs, crossing air gaps with an errant network cable, recreational arson/deliberate releases)
You need to have plans, monitoring, a compute cluster building code, an agency that tests the equipment (obviously by red teaming with ASI in isolated labs) and so on.
An escaped AI isn’t hot and glowing and a visible threat. It isn’t obvious that an escape has even occurred or where to draw the lines of the quarantine.
An “escaped” AI means compute clusters in the world that were not assigned to the AI are no longer running a model with the same digitally signed software their operators ordered.
This is falsifiable/measurable. While it is true that there’s nothing glowing or smoke to smell, it is checkable. (Checkable by things like using a separate computer hosted on the same main board to connect to the drive and hash the files containing the model or similar. It’s obviously crucially important to use a tool that cannot be hacked to lie to you)
The trouble with these rules is that they mean that someone saying “I played the AI-box game and I let the AI out” gives rather little evidence that that actually happened. For all we know, maybe all the stories of successful AI-box escapes are really stories where the gatekeeper was persuaded to pretend that they let the AI out of the box (maybe they were bribed to do that; maybe they decided that any hit to their reputation for strong-mindedness was outweighed by the benefits of encouraging others to believe that an AI could get out of the box; etc.). Or maybe they’re all really stories where the AI-player’s ability to get out of the box depends on something importantly different between their situation and that of a hypothetical real boxed AI (again, maybe they bribed the gatekeeper and the gatekeeper was willing to accept a smaller bribe when the outcome was “everyone is told I let the AI out” rather than whatever an actual AI might do once out of the box; etc.).
Of course, even without those rules it would still be possible for gatekeepers to lie about the results. But if e.g. a transcript were released then there’d be ways to try to notice those failure modes. If the gatekeeper-player lets the AI-player out of the box and a naysayer says “bah, I wouldn’t have been convinced”, that could be self-delusion on the naysayer’s part (or unawareness that someone playing against them might have adopted a different method that would have worked better on them) but it could also be that the gatekeeper-player really did let the AI-player out “too easily” in a way that wouldn’t transfer to the situations the game is meant to build intuitions about.
Perhaps a compromise solution would be: With the permission of both players, a trusted third party reads the chat logs and reports to the public only a few bits of information, e.g. “The hypothesis that the AI player just bribed the gatekeeper player with some real-world (i.e. not in-game) reward—is it false, true, or in between?” and “what about the hypothesis that the gatekeeper caved due to wanting to Spread the Word about AI Safety?”
That way no one has to have their intimate and probably somewhat embarrassing conversation put on the internet.
I think it misframes the problem.
Humans actually long ago solved a similar problem, fire safety. Every electrical junction is in a box. Many big buildings are made into separate big concrete boxes with sprinklers. The buildings are separated by gaps from one another with fire lanes and water ports. The city itself is separated into districts that have gaps. The country is separated into cities with gaps between.
Basically every box is surrounded by another box, and it is unlikely for a fire to burn down the earth.
You need a “box” of some kind around all the possible high performance computers an AI could escape to. Like fire you have to assume that careless humans will let AI out of boxes all the time (software bugs, crossing air gaps with an errant network cable, recreational arson/deliberate releases)
You need to have plans, monitoring, a compute cluster building code, an agency that tests the equipment (obviously by red teaming with ASI in isolated labs) and so on.
An escaped AI isn’t hot and glowing and a visible threat. It isn’t obvious that an escape has even occurred or where to draw the lines of the quarantine.
An “escaped” AI means compute clusters in the world that were not assigned to the AI are no longer running a model with the same digitally signed software their operators ordered.
This is falsifiable/measurable. While it is true that there’s nothing glowing or smoke to smell, it is checkable. (Checkable by things like using a separate computer hosted on the same main board to connect to the drive and hash the files containing the model or similar. It’s obviously crucially important to use a tool that cannot be hacked to lie to you)