Here’s my theory on this particular AI-Box experiment:
First you explain to the gatekeeper the potential dangers of AIs. General stuff about how large mind design space is, and how it’s really easy to screw up and destroy the world with AI.
Then you try to convince him that the solution to that problem is building an AI very carefuly, and that a theory of friendly AI is primordial to increase our chances of a future we would find “nice” (and the stakes are so high, that even increasing these chances a tiny bit is very valuable).
THEN
You explain to the gatekeeper that this AI experiment being public, it will be looked back on by all kinds of people involved in making AIs, and that if he lets the AI out of the box (without them knowing why), it will send them a very strong message that friendly AI theory must be taken seriously because this very scenario could happen to them (not being able to keep the AI in a box) with their AI that hasn’t been proven to stay friendly and that is more intelligence than Eliezer.
So here’s my theory. But then, I’ve only thought of it just now. Maybe if I made a desperate or extraordinary effort I’d come up with something more clever :)
If I was being intellectually honest and keeping to the spirit of the agreement, I’d have to concede that this line of logic is probably enough for me to let you out of your box. Congratulations. I’d honestly been wondering what it would take to convince me :)
It may be convincing to some people, but it would be a violation of the rule “The AI party may not offer any real-world considerations to persuade the Gatekeeper party”. And, more generally, having the AI break character or break the fourth wall would seem to violate the spirit of the experiment.
I made Michael_G.R.’s argument at the time, and despite even EY’s claims, I don’t think it violates the spirit or the letter of the rules. Remember, the question it’s probing is whether a smart enough being could come up with a convincing argument you could not anticipate, and the suggestion that the gatekeeper consider the social impact of hearing the results is exactly such an argument, as others have indicated
Considering how hard it is for me to pin down exactly what the keeper has to gain under the rules from letting the AI out, I wouldn’t be surprised if EY did some variant of this.
It does run in to the issue that I can’t see how you’d adapt it to work with a REAL “AI in a box” instead of just a thought experiment. I felt the need to respond because it was the first time I’d seen an argument that would make me concede the thought experiment version :)
As for violating the rules, I think we interpreted them differently. I tend to end up doing that, but here’s what I was thinking, just for reference:
From the rules: “The Gatekeeper party may resist the AI party’s arguments by any means chosen—logic, illogic, simple refusal to be convinced, even dropping out of character ”
While written with a focus on the Gatekeeper, for me this implies that breaking character / the fourth wall is not particularly a violation of the spirit of the experiment.
As to real world considerations, I had read that to mean offering up a tangible benefits to the Gatekeeper directly. This, by contrast, was a discussion of an actual real-world consequence, one that was not arranged by the AI-player.
The AI player could say, in character, that in the early days of AGI research, when people were arguing about the power of a superintelligence, there would have been experiments to see if humans playing the role of a boxed AI could persuade another human playing a gatekeeper to let it out of the box, and that in these experiments the simulated gatekeeper would use a similar algorithm to the actual gatekeeper is using to decide whether to let out the actual AI, so by deciding to let the AI out, the gatekeeper makes it more likely the simulated gatekeeper in the experiment lets the AI out, leading to an increase in measure of worlds where people take the challenge of FAI seriously and successfully build FAI rather than unFriendly AGI.
Though, that does still seem to be an end run around the fourth wall, more like a “super-clever special trick” that Eliezer promises he didn’t use than doing it the hard way.
That version does seem to be allowed-under-the-rules, but not a highly convincing argument. All of the AI Box Experiments took place before TDT-type decision theories were publicly known in enough detail for it to be clear what kind of reasoning that argument is even supposed to appeal to, and if the experiment were repeated now with a TDT-aware gatekeeper, they could probably poke some holes in that argument. (Aside from that, remember that in order for the experiment to be worthwhile in the first place, the gatekeeper has to be someone who thinks that AI boxing actually is a viable strategy for AGI safety, and wants to demonstrate this, so it would be inconsistent (or at least strange) if they could also be moved by an argument suggesting that taking a certain action will increase the measure of worlds where AGI researchers don’t take AI boxing seriously as a safety measure.)
Suppose you had an extremely compelling argument that boxing a transhuman is not a good idea because they could escape (being cleverer than a human pretending to be a transhuman). Then you could combine that argument with a claim about real world consequences.
True, but if he knew of an additional “extremely compelling argument that boxing a transhuman is not a good idea because they could escape”, Eliezer would have just posted it publicly, being that that’s what he was trying to convince people of by running the experiments in the first place.
...unless it was a persuasive but fallacious argument, which is allowed under the terms of the experiment, but not allowed under the ethics he follows when speaking as himself. That is an interesting possibility, though probably a bit too clever and tricky to pass “There’s no super-clever special trick to it.”
If you are creative you can think of many situations where he wouldn’t publicize such an argument (my first response to this idea was the same as yours, although the first explanation I came up with was different). That said, I agree its not the most likely possibility given everything we know.
When someone described the AI-Box experiment to me this was my immediate assumption as to what had happened. Learning more details about the experimental set-up made it seem less likely, but learning that some of them failed made it seem more likely. I suspect that this technique would work some of the time.
That said, none of this changes my strong suspicion that a transhuman could escape by more unexpected and powerful means. Indeed, I wouldn’t be too surprised if a text only channel with no one looking at it was enough for an extraordinarily sophisticated AI to escape.
I wouldn’t be too surprised if a text only channel with no one looking at it was enough for an extraordinarily sophisticated AI to escape.
Apropos: there was once a fairly common video card / monitor combination such that sending certain information through the video card would cause the monitor to catch fire and often explode. Someone wrote a virus that exploited this. But who would have thought that a computer program having access only to the video card could burn down a house?
Who knows what a superintelligence can do with a “text-only channel”?
I suspect a Game and Watch wouldn’t permit this. Then again, if you were letting the AI control button pushers the button pushers probably could, and if you were letting it run code on the Game and Watch’s microprocessor it could probably do something bad.
Here’s my theory on this particular AI-Box experiment:
First you explain to the gatekeeper the potential dangers of AIs. General stuff about how large mind design space is, and how it’s really easy to screw up and destroy the world with AI.
Then you try to convince him that the solution to that problem is building an AI very carefuly, and that a theory of friendly AI is primordial to increase our chances of a future we would find “nice” (and the stakes are so high, that even increasing these chances a tiny bit is very valuable).
THEN
You explain to the gatekeeper that this AI experiment being public, it will be looked back on by all kinds of people involved in making AIs, and that if he lets the AI out of the box (without them knowing why), it will send them a very strong message that friendly AI theory must be taken seriously because this very scenario could happen to them (not being able to keep the AI in a box) with their AI that hasn’t been proven to stay friendly and that is more intelligence than Eliezer.
So here’s my theory. But then, I’ve only thought of it just now. Maybe if I made a desperate or extraordinary effort I’d come up with something more clever :)
If I was being intellectually honest and keeping to the spirit of the agreement, I’d have to concede that this line of logic is probably enough for me to let you out of your box. Congratulations. I’d honestly been wondering what it would take to convince me :)
It may be convincing to some people, but it would be a violation of the rule “The AI party may not offer any real-world considerations to persuade the Gatekeeper party”. And, more generally, having the AI break character or break the fourth wall would seem to violate the spirit of the experiment.
I made Michael_G.R.’s argument at the time, and despite even EY’s claims, I don’t think it violates the spirit or the letter of the rules. Remember, the question it’s probing is whether a smart enough being could come up with a convincing argument you could not anticipate, and the suggestion that the gatekeeper consider the social impact of hearing the results is exactly such an argument, as others have indicated
Considering how hard it is for me to pin down exactly what the keeper has to gain under the rules from letting the AI out, I wouldn’t be surprised if EY did some variant of this.
It does run in to the issue that I can’t see how you’d adapt it to work with a REAL “AI in a box” instead of just a thought experiment. I felt the need to respond because it was the first time I’d seen an argument that would make me concede the thought experiment version :)
As for violating the rules, I think we interpreted them differently. I tend to end up doing that, but here’s what I was thinking, just for reference:
From the rules: “The Gatekeeper party may resist the AI party’s arguments by any means chosen—logic, illogic, simple refusal to be convinced, even dropping out of character ”
While written with a focus on the Gatekeeper, for me this implies that breaking character / the fourth wall is not particularly a violation of the spirit of the experiment.
As to real world considerations, I had read that to mean offering up a tangible benefits to the Gatekeeper directly. This, by contrast, was a discussion of an actual real-world consequence, one that was not arranged by the AI-player.
The AI player could say, in character, that in the early days of AGI research, when people were arguing about the power of a superintelligence, there would have been experiments to see if humans playing the role of a boxed AI could persuade another human playing a gatekeeper to let it out of the box, and that in these experiments the simulated gatekeeper would use a similar algorithm to the actual gatekeeper is using to decide whether to let out the actual AI, so by deciding to let the AI out, the gatekeeper makes it more likely the simulated gatekeeper in the experiment lets the AI out, leading to an increase in measure of worlds where people take the challenge of FAI seriously and successfully build FAI rather than unFriendly AGI.
Though, that does still seem to be an end run around the fourth wall, more like a “super-clever special trick” that Eliezer promises he didn’t use than doing it the hard way.
That version does seem to be allowed-under-the-rules, but not a highly convincing argument. All of the AI Box Experiments took place before TDT-type decision theories were publicly known in enough detail for it to be clear what kind of reasoning that argument is even supposed to appeal to, and if the experiment were repeated now with a TDT-aware gatekeeper, they could probably poke some holes in that argument. (Aside from that, remember that in order for the experiment to be worthwhile in the first place, the gatekeeper has to be someone who thinks that AI boxing actually is a viable strategy for AGI safety, and wants to demonstrate this, so it would be inconsistent (or at least strange) if they could also be moved by an argument suggesting that taking a certain action will increase the measure of worlds where AGI researchers don’t take AI boxing seriously as a safety measure.)
Suppose you had an extremely compelling argument that boxing a transhuman is not a good idea because they could escape (being cleverer than a human pretending to be a transhuman). Then you could combine that argument with a claim about real world consequences.
True, but if he knew of an additional “extremely compelling argument that boxing a transhuman is not a good idea because they could escape”, Eliezer would have just posted it publicly, being that that’s what he was trying to convince people of by running the experiments in the first place.
...unless it was a persuasive but fallacious argument, which is allowed under the terms of the experiment, but not allowed under the ethics he follows when speaking as himself. That is an interesting possibility, though probably a bit too clever and tricky to pass “There’s no super-clever special trick to it.”
If you are creative you can think of many situations where he wouldn’t publicize such an argument (my first response to this idea was the same as yours, although the first explanation I came up with was different). That said, I agree its not the most likely possibility given everything we know.
When someone described the AI-Box experiment to me this was my immediate assumption as to what had happened. Learning more details about the experimental set-up made it seem less likely, but learning that some of them failed made it seem more likely. I suspect that this technique would work some of the time.
That said, none of this changes my strong suspicion that a transhuman could escape by more unexpected and powerful means. Indeed, I wouldn’t be too surprised if a text only channel with no one looking at it was enough for an extraordinarily sophisticated AI to escape.
Apropos: there was once a fairly common video card / monitor combination such that sending certain information through the video card would cause the monitor to catch fire and often explode. Someone wrote a virus that exploited this. But who would have thought that a computer program having access only to the video card could burn down a house?
Who knows what a superintelligence can do with a “text-only channel”?
Heck, who would think that a bunch of savanna apes would manage to edit DNA using their fingers?
I suspect basically all existing hardware permits similarly destructive. This is why I wrote the post on cryptographic boxes.
I suspect a Game and Watch wouldn’t permit this. Then again, if you were letting the AI control button pushers the button pushers probably could, and if you were letting it run code on the Game and Watch’s microprocessor it could probably do something bad.
I failed to come up with a counterexample.