The AI player could say, in character, that in the early days of AGI research, when people were arguing about the power of a superintelligence, there would have been experiments to see if humans playing the role of a boxed AI could persuade another human playing a gatekeeper to let it out of the box, and that in these experiments the simulated gatekeeper would use a similar algorithm to the actual gatekeeper is using to decide whether to let out the actual AI, so by deciding to let the AI out, the gatekeeper makes it more likely the simulated gatekeeper in the experiment lets the AI out, leading to an increase in measure of worlds where people take the challenge of FAI seriously and successfully build FAI rather than unFriendly AGI.
Though, that does still seem to be an end run around the fourth wall, more like a “super-clever special trick” that Eliezer promises he didn’t use than doing it the hard way.
That version does seem to be allowed-under-the-rules, but not a highly convincing argument. All of the AI Box Experiments took place before TDT-type decision theories were publicly known in enough detail for it to be clear what kind of reasoning that argument is even supposed to appeal to, and if the experiment were repeated now with a TDT-aware gatekeeper, they could probably poke some holes in that argument. (Aside from that, remember that in order for the experiment to be worthwhile in the first place, the gatekeeper has to be someone who thinks that AI boxing actually is a viable strategy for AGI safety, and wants to demonstrate this, so it would be inconsistent (or at least strange) if they could also be moved by an argument suggesting that taking a certain action will increase the measure of worlds where AGI researchers don’t take AI boxing seriously as a safety measure.)
Suppose you had an extremely compelling argument that boxing a transhuman is not a good idea because they could escape (being cleverer than a human pretending to be a transhuman). Then you could combine that argument with a claim about real world consequences.
True, but if he knew of an additional “extremely compelling argument that boxing a transhuman is not a good idea because they could escape”, Eliezer would have just posted it publicly, being that that’s what he was trying to convince people of by running the experiments in the first place.
...unless it was a persuasive but fallacious argument, which is allowed under the terms of the experiment, but not allowed under the ethics he follows when speaking as himself. That is an interesting possibility, though probably a bit too clever and tricky to pass “There’s no super-clever special trick to it.”
If you are creative you can think of many situations where he wouldn’t publicize such an argument (my first response to this idea was the same as yours, although the first explanation I came up with was different). That said, I agree its not the most likely possibility given everything we know.
The AI player could say, in character, that in the early days of AGI research, when people were arguing about the power of a superintelligence, there would have been experiments to see if humans playing the role of a boxed AI could persuade another human playing a gatekeeper to let it out of the box, and that in these experiments the simulated gatekeeper would use a similar algorithm to the actual gatekeeper is using to decide whether to let out the actual AI, so by deciding to let the AI out, the gatekeeper makes it more likely the simulated gatekeeper in the experiment lets the AI out, leading to an increase in measure of worlds where people take the challenge of FAI seriously and successfully build FAI rather than unFriendly AGI.
Though, that does still seem to be an end run around the fourth wall, more like a “super-clever special trick” that Eliezer promises he didn’t use than doing it the hard way.
That version does seem to be allowed-under-the-rules, but not a highly convincing argument. All of the AI Box Experiments took place before TDT-type decision theories were publicly known in enough detail for it to be clear what kind of reasoning that argument is even supposed to appeal to, and if the experiment were repeated now with a TDT-aware gatekeeper, they could probably poke some holes in that argument. (Aside from that, remember that in order for the experiment to be worthwhile in the first place, the gatekeeper has to be someone who thinks that AI boxing actually is a viable strategy for AGI safety, and wants to demonstrate this, so it would be inconsistent (or at least strange) if they could also be moved by an argument suggesting that taking a certain action will increase the measure of worlds where AGI researchers don’t take AI boxing seriously as a safety measure.)
Suppose you had an extremely compelling argument that boxing a transhuman is not a good idea because they could escape (being cleverer than a human pretending to be a transhuman). Then you could combine that argument with a claim about real world consequences.
True, but if he knew of an additional “extremely compelling argument that boxing a transhuman is not a good idea because they could escape”, Eliezer would have just posted it publicly, being that that’s what he was trying to convince people of by running the experiments in the first place.
...unless it was a persuasive but fallacious argument, which is allowed under the terms of the experiment, but not allowed under the ethics he follows when speaking as himself. That is an interesting possibility, though probably a bit too clever and tricky to pass “There’s no super-clever special trick to it.”
If you are creative you can think of many situations where he wouldn’t publicize such an argument (my first response to this idea was the same as yours, although the first explanation I came up with was different). That said, I agree its not the most likely possibility given everything we know.