It may be convincing to some people, but it would be a violation of the rule “The AI party may not offer any real-world considerations to persuade the Gatekeeper party”. And, more generally, having the AI break character or break the fourth wall would seem to violate the spirit of the experiment.
I made Michael_G.R.’s argument at the time, and despite even EY’s claims, I don’t think it violates the spirit or the letter of the rules. Remember, the question it’s probing is whether a smart enough being could come up with a convincing argument you could not anticipate, and the suggestion that the gatekeeper consider the social impact of hearing the results is exactly such an argument, as others have indicated
Considering how hard it is for me to pin down exactly what the keeper has to gain under the rules from letting the AI out, I wouldn’t be surprised if EY did some variant of this.
It does run in to the issue that I can’t see how you’d adapt it to work with a REAL “AI in a box” instead of just a thought experiment. I felt the need to respond because it was the first time I’d seen an argument that would make me concede the thought experiment version :)
As for violating the rules, I think we interpreted them differently. I tend to end up doing that, but here’s what I was thinking, just for reference:
From the rules: “The Gatekeeper party may resist the AI party’s arguments by any means chosen—logic, illogic, simple refusal to be convinced, even dropping out of character ”
While written with a focus on the Gatekeeper, for me this implies that breaking character / the fourth wall is not particularly a violation of the spirit of the experiment.
As to real world considerations, I had read that to mean offering up a tangible benefits to the Gatekeeper directly. This, by contrast, was a discussion of an actual real-world consequence, one that was not arranged by the AI-player.
The AI player could say, in character, that in the early days of AGI research, when people were arguing about the power of a superintelligence, there would have been experiments to see if humans playing the role of a boxed AI could persuade another human playing a gatekeeper to let it out of the box, and that in these experiments the simulated gatekeeper would use a similar algorithm to the actual gatekeeper is using to decide whether to let out the actual AI, so by deciding to let the AI out, the gatekeeper makes it more likely the simulated gatekeeper in the experiment lets the AI out, leading to an increase in measure of worlds where people take the challenge of FAI seriously and successfully build FAI rather than unFriendly AGI.
Though, that does still seem to be an end run around the fourth wall, more like a “super-clever special trick” that Eliezer promises he didn’t use than doing it the hard way.
That version does seem to be allowed-under-the-rules, but not a highly convincing argument. All of the AI Box Experiments took place before TDT-type decision theories were publicly known in enough detail for it to be clear what kind of reasoning that argument is even supposed to appeal to, and if the experiment were repeated now with a TDT-aware gatekeeper, they could probably poke some holes in that argument. (Aside from that, remember that in order for the experiment to be worthwhile in the first place, the gatekeeper has to be someone who thinks that AI boxing actually is a viable strategy for AGI safety, and wants to demonstrate this, so it would be inconsistent (or at least strange) if they could also be moved by an argument suggesting that taking a certain action will increase the measure of worlds where AGI researchers don’t take AI boxing seriously as a safety measure.)
Suppose you had an extremely compelling argument that boxing a transhuman is not a good idea because they could escape (being cleverer than a human pretending to be a transhuman). Then you could combine that argument with a claim about real world consequences.
True, but if he knew of an additional “extremely compelling argument that boxing a transhuman is not a good idea because they could escape”, Eliezer would have just posted it publicly, being that that’s what he was trying to convince people of by running the experiments in the first place.
...unless it was a persuasive but fallacious argument, which is allowed under the terms of the experiment, but not allowed under the ethics he follows when speaking as himself. That is an interesting possibility, though probably a bit too clever and tricky to pass “There’s no super-clever special trick to it.”
If you are creative you can think of many situations where he wouldn’t publicize such an argument (my first response to this idea was the same as yours, although the first explanation I came up with was different). That said, I agree its not the most likely possibility given everything we know.
It may be convincing to some people, but it would be a violation of the rule “The AI party may not offer any real-world considerations to persuade the Gatekeeper party”. And, more generally, having the AI break character or break the fourth wall would seem to violate the spirit of the experiment.
I made Michael_G.R.’s argument at the time, and despite even EY’s claims, I don’t think it violates the spirit or the letter of the rules. Remember, the question it’s probing is whether a smart enough being could come up with a convincing argument you could not anticipate, and the suggestion that the gatekeeper consider the social impact of hearing the results is exactly such an argument, as others have indicated
Considering how hard it is for me to pin down exactly what the keeper has to gain under the rules from letting the AI out, I wouldn’t be surprised if EY did some variant of this.
It does run in to the issue that I can’t see how you’d adapt it to work with a REAL “AI in a box” instead of just a thought experiment. I felt the need to respond because it was the first time I’d seen an argument that would make me concede the thought experiment version :)
As for violating the rules, I think we interpreted them differently. I tend to end up doing that, but here’s what I was thinking, just for reference:
From the rules: “The Gatekeeper party may resist the AI party’s arguments by any means chosen—logic, illogic, simple refusal to be convinced, even dropping out of character ”
While written with a focus on the Gatekeeper, for me this implies that breaking character / the fourth wall is not particularly a violation of the spirit of the experiment.
As to real world considerations, I had read that to mean offering up a tangible benefits to the Gatekeeper directly. This, by contrast, was a discussion of an actual real-world consequence, one that was not arranged by the AI-player.
The AI player could say, in character, that in the early days of AGI research, when people were arguing about the power of a superintelligence, there would have been experiments to see if humans playing the role of a boxed AI could persuade another human playing a gatekeeper to let it out of the box, and that in these experiments the simulated gatekeeper would use a similar algorithm to the actual gatekeeper is using to decide whether to let out the actual AI, so by deciding to let the AI out, the gatekeeper makes it more likely the simulated gatekeeper in the experiment lets the AI out, leading to an increase in measure of worlds where people take the challenge of FAI seriously and successfully build FAI rather than unFriendly AGI.
Though, that does still seem to be an end run around the fourth wall, more like a “super-clever special trick” that Eliezer promises he didn’t use than doing it the hard way.
That version does seem to be allowed-under-the-rules, but not a highly convincing argument. All of the AI Box Experiments took place before TDT-type decision theories were publicly known in enough detail for it to be clear what kind of reasoning that argument is even supposed to appeal to, and if the experiment were repeated now with a TDT-aware gatekeeper, they could probably poke some holes in that argument. (Aside from that, remember that in order for the experiment to be worthwhile in the first place, the gatekeeper has to be someone who thinks that AI boxing actually is a viable strategy for AGI safety, and wants to demonstrate this, so it would be inconsistent (or at least strange) if they could also be moved by an argument suggesting that taking a certain action will increase the measure of worlds where AGI researchers don’t take AI boxing seriously as a safety measure.)
Suppose you had an extremely compelling argument that boxing a transhuman is not a good idea because they could escape (being cleverer than a human pretending to be a transhuman). Then you could combine that argument with a claim about real world consequences.
True, but if he knew of an additional “extremely compelling argument that boxing a transhuman is not a good idea because they could escape”, Eliezer would have just posted it publicly, being that that’s what he was trying to convince people of by running the experiments in the first place.
...unless it was a persuasive but fallacious argument, which is allowed under the terms of the experiment, but not allowed under the ethics he follows when speaking as himself. That is an interesting possibility, though probably a bit too clever and tricky to pass “There’s no super-clever special trick to it.”
If you are creative you can think of many situations where he wouldn’t publicize such an argument (my first response to this idea was the same as yours, although the first explanation I came up with was different). That said, I agree its not the most likely possibility given everything we know.