...but I don’t see how a victory for the AI party in such an experiment discredits the idea of boxed AI. It simply shows that boxes are not a 100% reliable safeguard. Do boxes foreclose on alternative safeguards that we can show to be more reliable?
The original claim under dispute, at least according to EY’s page, was that boxing an AI of unknown friendliness was, by itself, a viable approach to AI safety. Disregarding all the other ways such an AI might circumvent any “box”, the experiment purports to test the claim that something could simply talk its way out of the box—just to test that one point of failure, and with merely human intelligence.
Maybe the supposed original claim is a strawman or misrepresentation; I wasn’t involved in the original conversations, so I’m not sure. In any case, the experiment is intended to test/demonstrate that boxing alone is not sufficient, even given a perfect box which can only be opened with the Gatekeeper’s approval. Whether boxing is a useful-but-not-guaranteed safety procedure is a different question.
Insofar as someone chose for there to be a gatekeeper rather than a lock whose key got tossed into a volcano, a gatekeeper must be possible to “hack through a text terminal” by meeting their evidentiary standard for friendliness.
The problem is this happening in the absence of genuine friendliness.
Do you believe Eliezer’s (or Tuxedage’s) wins were achieved by meeting the Gatekeeper’s standard for Friendliness, or some other method (e.g. psychological warfare, inducing and exploiting emotional states, etc)?
My impression has been that “boxing” is considered non-viable not just because it’s hard to tell if an AI is really truly Friendly, but because it won’t hold even an obvious UFAI that wants out.
Probably the latter, since they both lost at least once. A real AI trying to get out would devote all its energies to counterfeiting friendliness and probably succeeding.
Boxing is non-viable only in the same sense that locks, passwords, treaties, contracts, testing, peer review, seatbelts, and all other imperfect precautions are non-viable.
Pile on enough of them, in combination, and perhaps they will buy a few years or seconds in which to react. All things equal, is there any reason why an AI of unknown friendliness is any safer without being boxed?
A flawed containment method is still better at containing than no containment method (if implemented with awareness of its flaws) but apparently a flawed friendly AI will miss a very small target in goalspace and for practical purposes be unfriendly. So, if we spent 5 minutes considering the alternatives, would we continue to believe that better boxes are a less tractable problem than friendly AI?
Not that particular AI. But if you think yours is Friendly and others under development have a sufficient probability of being UnFriendly, then trivially, letting it run (in both senses) beats boxing. Oh, and people will die ‘naturally’ while you dither. I hope that thinking this much about making an AI Friendly will prepare someone to get the job done ASAP once the AGI part seems more feasible.
The original claim under dispute, at least according to EY’s page, was that boxing an AI of unknown friendliness was, by itself, a viable approach to AI safety. Disregarding all the other ways such an AI might circumvent any “box”, the experiment purports to test the claim that something could simply talk its way out of the box—just to test that one point of failure, and with merely human intelligence.
Maybe the supposed original claim is a strawman or misrepresentation; I wasn’t involved in the original conversations, so I’m not sure. In any case, the experiment is intended to test/demonstrate that boxing alone is not sufficient, even given a perfect box which can only be opened with the Gatekeeper’s approval. Whether boxing is a useful-but-not-guaranteed safety procedure is a different question.
I understand the claim under dispute, I think.
Insofar as someone chose for there to be a gatekeeper rather than a lock whose key got tossed into a volcano, a gatekeeper must be possible to “hack through a text terminal” by meeting their evidentiary standard for friendliness.
The problem is this happening in the absence of genuine friendliness.
Do you believe Eliezer’s (or Tuxedage’s) wins were achieved by meeting the Gatekeeper’s standard for Friendliness, or some other method (e.g. psychological warfare, inducing and exploiting emotional states, etc)?
My impression has been that “boxing” is considered non-viable not just because it’s hard to tell if an AI is really truly Friendly, but because it won’t hold even an obvious UFAI that wants out.
Probably the latter, since they both lost at least once. A real AI trying to get out would devote all its energies to counterfeiting friendliness and probably succeeding.
Boxing is non-viable only in the same sense that locks, passwords, treaties, contracts, testing, peer review, seatbelts, and all other imperfect precautions are non-viable.
Pile on enough of them, in combination, and perhaps they will buy a few years or seconds in which to react. All things equal, is there any reason why an AI of unknown friendliness is any safer without being boxed?
A flawed containment method is still better at containing than no containment method (if implemented with awareness of its flaws) but apparently a flawed friendly AI will miss a very small target in goalspace and for practical purposes be unfriendly. So, if we spent 5 minutes considering the alternatives, would we continue to believe that better boxes are a less tractable problem than friendly AI?
Not that particular AI. But if you think yours is Friendly and others under development have a sufficient probability of being UnFriendly, then trivially, letting it run (in both senses) beats boxing. Oh, and people will die ‘naturally’ while you dither. I hope that thinking this much about making an AI Friendly will prepare someone to get the job done ASAP once the AGI part seems more feasible.