Apparently, the idea is that this sort of game tells us something useful about AI safety.
But I don’t get it.
You obviously knew that you were not unleashing a probably-malign superintelligence on the the world by letting Ra out. So how does your letting Ra out in this game say anything about how you would behave if you did think that (at least initially)?
So I don’t get it.
And if this does say something useful about AI safety, why is it against the rules to tell us how Ra won?
I think the debate about AI defense strategies has moved well past the idea of “well, just keep it locked up where it can’t reach it’s own power switch or make any choice that impact itself”.
I agree that this was never a particularly compelling demonstration, especially without revealing the “one weird trick” that the AI players use to get out. But I was also never of the opinion that it’d work anyway.
It’s still mildly interesting when humans fail to predict even other humans’ ability to socially-engineer their behavior. I don’t think it says much about AI safety, but it does say something about human vulnerabilities.
I disagree. I think that there is an important point to be made here about AI safety. A lot of people have made the argument that ‘an agent which can only work by communicating via text on a computer screen, sent to a human who has been pre-warned to not let it out’ can never result in a reasonable thoughtful intelligent human choosing to let it out.
I think that the fact that this experiment has been run a few times, and sometimes results in the guardian loosing presents at least some evidence that this claim of persuasion-immunity is false.
I think this claim still matters for AI safety, even in the current world.
Yes, the frontier labs have been their models access the internet, and communicate to the wider world. But they have not been doing so without safety screening. This safety screening involves, at least in part, a human reading text produced by the model and deciding ‘it is ok to approve this model for contact with the world’.
Currently, I think that the frontier labs contain a lot of employees who would be more in agreement with the description of themselves as not vulnerable to being persuaded to incorrectly approve a model for release. I think this is overconfidence on their part. Humans are vulnerable to being tricked and conned. I’m not saying that nobody can ever avoid being persuaded, just that we can’t assume the robustness of this safety measure and should devise better safety measures.
ok sure, but this game is still terrible and is evidence against clear thinking about the problem at hand on the part of anyone who plays it as a test of anything but whether they’ll win a social deception game. perhaps revealing the transcript would fix it; I sort of doubt I’d be convinced then either. I just don’t think there’s a good way to set this up so there are any significant constraints making the situation similar.
Apparently, the idea is that this sort of game tells us something useful about AI safety.
The game was designed under the assumption that when humans create a superhuman intelligence capable of destroying humanity if it chooses so, they would hesitate to unleash it on the world. Metaphorically speaking, they will keep the prototype locked in a box (i.e. isolated from the internet, other humans, robotic bodies, factories that could produce robotic bodies, etc.), until they are somehow convinced that the intelligence is not going to kill us (presumably by figuring out a way to prove that mathematically from its source code).
This assumption seems a little silly in retrospect. Of course when a company creates a potentially omnicidal artificial intelligence, they first thing they will do is connect it to the internet, and the second thing they will do is integrate it with all kinds of stuff that already exists (e-mails, calendars, household devices, self-driving cars, drones, nuclear weapons). How else are they going to pay for the costs of research?
Okay so, on the one hand, this post wasn’t really meant to be a persuasive argument against AI boxing as a security strategy. If I wanted to do that I wouldn’t play the game — I started out certain that a real ASI could break out, and that hasn’t changed. My reasoning for that isn’t based on experimental evidence, and even if I had won the game I don’t think that would have said much about my ability to hold out against a real ASI. Besides, in real life, we don’t even try to use AI boxes. OpenAI and Google gave their AIs free internet access a few months after launching them.
I made this post out of a vague sense that it’s good to write up the results of things like this and make them publicly available. There are other AI box reports on LW, and I felt like it was good (in a vague “good rationalist” way) to add mine to the list.
Buuuut.… I do actually think that it’s not as cut and dry as you make it sound? Yes, the stakes are lower in the game, but the challenge is also much easier!
you only have to hold out for 2 hours, not ~forever, doing this as a full time job
the AI player can only escape if you voluntarily say it does; it can’t upload itself to the internet or exfiltrate its weights to another computer
the AI player isn’t actually superintelligent
etc
(Of course that doesn’t mean these two factors balance perfectly, but I still think the fact that AI players can win at all with such massive handicaps is at least weak evidence for an ASI being able to do it.)
It’s against the rules to explain how Ra won because (quoting Yudkowsky’s official rules):
Regardless of the result, neither party shall ever reveal anything of what goes on within the AI-Box experiment except the outcome. Exceptions to this rule may occur only with the consent of both parties.
- Neither the AI party nor the Gatekeeper party need be concerned about real-world embarassment resulting from trickery on the AI’s part or obstinacy on the Gatekeeper’s part.
- If Gatekeeper lets the AI out, naysayers can’t say “Oh, I wouldn’t have been convinced by that.” As long as they don’t know what happened to the Gatekeeper, they can’t argue themselves into believing it wouldn’t happen to them.
Basically, Yudkowsky didn’t want to have to defeat every single challenger to get people to admit that AI boxing was a bad idea. Nobody has time for that, and I think even a single case of the AI winning is enough to make the point, given the handicaps the AI plays under.
The trouble with these rules is that they mean that someone saying “I played the AI-box game and I let the AI out” gives rather little evidence that that actually happened. For all we know, maybe all the stories of successful AI-box escapes are really stories where the gatekeeper was persuaded to pretend that they let the AI out of the box (maybe they were bribed to do that; maybe they decided that any hit to their reputation for strong-mindedness was outweighed by the benefits of encouraging others to believe that an AI could get out of the box; etc.). Or maybe they’re all really stories where the AI-player’s ability to get out of the box depends on something importantly different between their situation and that of a hypothetical real boxed AI (again, maybe they bribed the gatekeeper and the gatekeeper was willing to accept a smaller bribe when the outcome was “everyone is told I let the AI out” rather than whatever an actual AI might do once out of the box; etc.).
Of course, even without those rules it would still be possible for gatekeepers to lie about the results. But if e.g. a transcript were released then there’d be ways to try to notice those failure modes. If the gatekeeper-player lets the AI-player out of the box and a naysayer says “bah, I wouldn’t have been convinced”, that could be self-delusion on the naysayer’s part (or unawareness that someone playing against them might have adopted a different method that would have worked better on them) but it could also be that the gatekeeper-player really did let the AI-player out “too easily” in a way that wouldn’t transfer to the situations the game is meant to build intuitions about.
Perhaps a compromise solution would be: With the permission of both players, a trusted third party reads the chat logs and reports to the public only a few bits of information, e.g. “The hypothesis that the AI player just bribed the gatekeeper player with some real-world (i.e. not in-game) reward—is it false, true, or in between?” and “what about the hypothesis that the gatekeeper caved due to wanting to Spread the Word about AI Safety?”
That way no one has to have their intimate and probably somewhat embarrassing conversation put on the internet.
Humans actually long ago solved a similar problem, fire safety. Every electrical junction is in a box. Many big buildings are made into separate big concrete boxes with sprinklers. The buildings are separated by gaps from one another with fire lanes and water ports. The city itself is separated into districts that have gaps. The country is separated into cities with gaps between.
Basically every box is surrounded by another box, and it is unlikely for a fire to burn down the earth.
You need a “box” of some kind around all the possible high performance computers an AI could escape to. Like fire you have to assume that careless humans will let AI out of boxes all the time (software bugs, crossing air gaps with an errant network cable, recreational arson/deliberate releases)
You need to have plans, monitoring, a compute cluster building code, an agency that tests the equipment (obviously by red teaming with ASI in isolated labs) and so on.
An escaped AI isn’t hot and glowing and a visible threat. It isn’t obvious that an escape has even occurred or where to draw the lines of the quarantine.
An “escaped” AI means compute clusters in the world that were not assigned to the AI are no longer running a model with the same digitally signed software their operators ordered.
This is falsifiable/measurable. While it is true that there’s nothing glowing or smoke to smell, it is checkable. (Checkable by things like using a separate computer hosted on the same main board to connect to the drive and hash the files containing the model or similar. It’s obviously crucially important to use a tool that cannot be hacked to lie to you)
But the big caveat is the exception “with the consent of both parties.” I realize that Eliezer doesn’t want to play against all comers, but presumably, nobody is expecting Ra and Datawitch to defend themselves against random members of the public.
I’m willing to believe that the “AI” can win this game since we have multiple claims to have done that, so knowing the method seems like it would benefit everybody.
We kept the secrecy rule because it was the default but I stand by it now as well. There are a lot of things I said in that convo that I wouldn’t want posted on lesswrong, enough that I think the convo would have been different without the expectation of privacy. Observing behavior often changes it.
If a game is began with the notion that it’ll be posted online, one of two things, or both will happen. Either (a) the AI is constrained by the techniques they can implore, unwilling to embarrass themselves or the gatekeeper to a public audience (especially when it comes down to personal details.), or (b) the Gatekeeper now has a HUGE incentive not to let the AI out; to avoid being known as the sucker who let the AI out...
Even if you could solve this by changing details and anonymising, it seems to me that the techniques are so personal and specific that changing them in any way would make the entire dialogue make even less sense.
The only other solution is to have a third-party monitor the game and post it without consent (which is obviously unethical, but probably the only real way you could get a truly authentic transcript.)
Two things don’t have to be completely identical to each other for one to give us useful information about the other. Even though the game is not completely identical to the risky scenario (as you pointed out: you don’t play against a malign superintelligence), it serves as useful evidence to those who believe that they can’t possibly lose the game against a regular human.
I don’t get it.
Apparently, the idea is that this sort of game tells us something useful about AI safety.
But I don’t get it.
You obviously knew that you were not unleashing a probably-malign superintelligence on the the world by letting Ra out. So how does your letting Ra out in this game say anything about how you would behave if you did think that (at least initially)?
So I don’t get it.
And if this does say something useful about AI safety, why is it against the rules to tell us how Ra won?
I don’t get it.
I think the debate about AI defense strategies has moved well past the idea of “well, just keep it locked up where it can’t reach it’s own power switch or make any choice that impact itself”.
I agree that this was never a particularly compelling demonstration, especially without revealing the “one weird trick” that the AI players use to get out. But I was also never of the opinion that it’d work anyway.
It’s still mildly interesting when humans fail to predict even other humans’ ability to socially-engineer their behavior. I don’t think it says much about AI safety, but it does say something about human vulnerabilities.
I disagree. I think that there is an important point to be made here about AI safety. A lot of people have made the argument that ‘an agent which can only work by communicating via text on a computer screen, sent to a human who has been pre-warned to not let it out’ can never result in a reasonable thoughtful intelligent human choosing to let it out. I think that the fact that this experiment has been run a few times, and sometimes results in the guardian loosing presents at least some evidence that this claim of persuasion-immunity is false. I think this claim still matters for AI safety, even in the current world. Yes, the frontier labs have been their models access the internet, and communicate to the wider world. But they have not been doing so without safety screening. This safety screening involves, at least in part, a human reading text produced by the model and deciding ‘it is ok to approve this model for contact with the world’. Currently, I think that the frontier labs contain a lot of employees who would be more in agreement with the description of themselves as not vulnerable to being persuaded to incorrectly approve a model for release. I think this is overconfidence on their part. Humans are vulnerable to being tricked and conned. I’m not saying that nobody can ever avoid being persuaded, just that we can’t assume the robustness of this safety measure and should devise better safety measures.
ok sure, but this game is still terrible and is evidence against clear thinking about the problem at hand on the part of anyone who plays it as a test of anything but whether they’ll win a social deception game. perhaps revealing the transcript would fix it; I sort of doubt I’d be convinced then either. I just don’t think there’s a good way to set this up so there are any significant constraints making the situation similar.
The game was designed under the assumption that when humans create a superhuman intelligence capable of destroying humanity if it chooses so, they would hesitate to unleash it on the world. Metaphorically speaking, they will keep the prototype locked in a box (i.e. isolated from the internet, other humans, robotic bodies, factories that could produce robotic bodies, etc.), until they are somehow convinced that the intelligence is not going to kill us (presumably by figuring out a way to prove that mathematically from its source code).
This assumption seems a little silly in retrospect. Of course when a company creates a potentially omnicidal artificial intelligence, they first thing they will do is connect it to the internet, and the second thing they will do is integrate it with all kinds of stuff that already exists (e-mails, calendars, household devices, self-driving cars, drones, nuclear weapons). How else are they going to pay for the costs of research?
Okay so, on the one hand, this post wasn’t really meant to be a persuasive argument against AI boxing as a security strategy. If I wanted to do that I wouldn’t play the game — I started out certain that a real ASI could break out, and that hasn’t changed. My reasoning for that isn’t based on experimental evidence, and even if I had won the game I don’t think that would have said much about my ability to hold out against a real ASI. Besides, in real life, we don’t even try to use AI boxes. OpenAI and Google gave their AIs free internet access a few months after launching them.
I made this post out of a vague sense that it’s good to write up the results of things like this and make them publicly available. There are other AI box reports on LW, and I felt like it was good (in a vague “good rationalist” way) to add mine to the list.
Buuuut.… I do actually think that it’s not as cut and dry as you make it sound? Yes, the stakes are lower in the game, but the challenge is also much easier!
you only have to hold out for 2 hours, not ~forever, doing this as a full time job
the AI player can only escape if you voluntarily say it does; it can’t upload itself to the internet or exfiltrate its weights to another computer
the AI player isn’t actually superintelligent
etc
(Of course that doesn’t mean these two factors balance perfectly, but I still think the fact that AI players can win at all with such massive handicaps is at least weak evidence for an ASI being able to do it.)
It’s against the rules to explain how Ra won because (quoting Yudkowsky’s official rules):
Basically, Yudkowsky didn’t want to have to defeat every single challenger to get people to admit that AI boxing was a bad idea. Nobody has time for that, and I think even a single case of the AI winning is enough to make the point, given the handicaps the AI plays under.
The trouble with these rules is that they mean that someone saying “I played the AI-box game and I let the AI out” gives rather little evidence that that actually happened. For all we know, maybe all the stories of successful AI-box escapes are really stories where the gatekeeper was persuaded to pretend that they let the AI out of the box (maybe they were bribed to do that; maybe they decided that any hit to their reputation for strong-mindedness was outweighed by the benefits of encouraging others to believe that an AI could get out of the box; etc.). Or maybe they’re all really stories where the AI-player’s ability to get out of the box depends on something importantly different between their situation and that of a hypothetical real boxed AI (again, maybe they bribed the gatekeeper and the gatekeeper was willing to accept a smaller bribe when the outcome was “everyone is told I let the AI out” rather than whatever an actual AI might do once out of the box; etc.).
Of course, even without those rules it would still be possible for gatekeepers to lie about the results. But if e.g. a transcript were released then there’d be ways to try to notice those failure modes. If the gatekeeper-player lets the AI-player out of the box and a naysayer says “bah, I wouldn’t have been convinced”, that could be self-delusion on the naysayer’s part (or unawareness that someone playing against them might have adopted a different method that would have worked better on them) but it could also be that the gatekeeper-player really did let the AI-player out “too easily” in a way that wouldn’t transfer to the situations the game is meant to build intuitions about.
Perhaps a compromise solution would be: With the permission of both players, a trusted third party reads the chat logs and reports to the public only a few bits of information, e.g. “The hypothesis that the AI player just bribed the gatekeeper player with some real-world (i.e. not in-game) reward—is it false, true, or in between?” and “what about the hypothesis that the gatekeeper caved due to wanting to Spread the Word about AI Safety?”
That way no one has to have their intimate and probably somewhat embarrassing conversation put on the internet.
I think it misframes the problem.
Humans actually long ago solved a similar problem, fire safety. Every electrical junction is in a box. Many big buildings are made into separate big concrete boxes with sprinklers. The buildings are separated by gaps from one another with fire lanes and water ports. The city itself is separated into districts that have gaps. The country is separated into cities with gaps between.
Basically every box is surrounded by another box, and it is unlikely for a fire to burn down the earth.
You need a “box” of some kind around all the possible high performance computers an AI could escape to. Like fire you have to assume that careless humans will let AI out of boxes all the time (software bugs, crossing air gaps with an errant network cable, recreational arson/deliberate releases)
You need to have plans, monitoring, a compute cluster building code, an agency that tests the equipment (obviously by red teaming with ASI in isolated labs) and so on.
An escaped AI isn’t hot and glowing and a visible threat. It isn’t obvious that an escape has even occurred or where to draw the lines of the quarantine.
An “escaped” AI means compute clusters in the world that were not assigned to the AI are no longer running a model with the same digitally signed software their operators ordered.
This is falsifiable/measurable. While it is true that there’s nothing glowing or smoke to smell, it is checkable. (Checkable by things like using a separate computer hosted on the same main board to connect to the drive and hash the files containing the model or similar. It’s obviously crucially important to use a tool that cannot be hacked to lie to you)
But the big caveat is the exception “with the consent of both parties.” I realize that Eliezer doesn’t want to play against all comers, but presumably, nobody is expecting Ra and Datawitch to defend themselves against random members of the public.
I’m willing to believe that the “AI” can win this game since we have multiple claims to have done that, so knowing the method seems like it would benefit everybody.
[edited to fix a misspelling of Eliezer’s name]
We kept the secrecy rule because it was the default but I stand by it now as well. There are a lot of things I said in that convo that I wouldn’t want posted on lesswrong, enough that I think the convo would have been different without the expectation of privacy. Observing behavior often changes it.
That last bit is particularly important methinks.
If a game is began with the notion that it’ll be posted online, one of two things, or both will happen. Either (a) the AI is constrained by the techniques they can implore, unwilling to embarrass themselves or the gatekeeper to a public audience (especially when it comes down to personal details.), or (b) the Gatekeeper now has a HUGE incentive not to let the AI out; to avoid being known as the sucker who let the AI out...
Even if you could solve this by changing details and anonymising, it seems to me that the techniques are so personal and specific that changing them in any way would make the entire dialogue make even less sense.
The only other solution is to have a third-party monitor the game and post it without consent (which is obviously unethical, but probably the only real way you could get a truly authentic transcript.)
Two things don’t have to be completely identical to each other for one to give us useful information about the other. Even though the game is not completely identical to the risky scenario (as you pointed out: you don’t play against a malign superintelligence), it serves as useful evidence to those who believe that they can’t possibly lose the game against a regular human.