The only way I can see this happening with non-negligible probability is if we create AGI along more human lines—e.g, uploaded brains which evolve through a harsh selection process that wouldn’t be aligned with human values. In that scenario, it may be near certain. Nothing is closer to a mind design capable of torturing humans than another human mind—we do that all the time today.
As others point out, though, the idea of a sign being flipped in an explicit utility function is one that people understand and are already looking for. More than that, it would only produce minimal human-utility if the AI had a correct description of human utility. Otherwise, it would just use us for fuel and building material. The optimization part also has to work well enough. Everything about the AGI, loosely speaking, has to be near-perfect except for that one bit. This naively suggests a probability near zero. I can’t imagine a counter-scenario clearly enough to make me change this estimate, if you don’t count the previous paragraph.
Everything about the AGI, loosely speaking, has to be near-perfect except for that one bit.
Isn’t this exactly what happened with the GPT-2 bug, which led to maximally ‘bad’ output? Would that not suggest that the probability of this occurring with an AGI is non-negligible?
No. First, people thinking of creating an AGI from scratch (i.e., one comparable to the sort of AI you’re imagining) have already warned against this exact issue and talked about measures to prevent a simple change of one bit from having any effect. (It’s the problem you don’t spot that’ll kill you.)
Second, GPT-2 is not near-perfect. It does pretty well at a job it was never intended to do, but if we ignore that context it seems pretty flawed. Naturally, its output was nowhere near maximally bad. The program did indeed have a silly flaw, but I assume that’s because it’s more of a silly experiment than a model for AGI. Indeed, if I try to imagine making GPT-N dangerous, I come up with the idea of an artificial programmer that uses vaguely similar principles to auto-complete programs and could thus self-improve. Reversing the sign of its reward function would then make it produce garbage code or non-code, rendering it mostly harmless.
Again, it’s the subtle flaw you don’t spot in GPT-N that could produce an AI capable of killing you.
The only way I can see this happening with non-negligible probability is if we create AGI along more human lines—e.g, uploaded brains which evolve through a harsh selection process that wouldn’t be aligned with human values. In that scenario, it may be near certain. Nothing is closer to a mind design capable of torturing humans than another human mind—we do that all the time today.
As others point out, though, the idea of a sign being flipped in an explicit utility function is one that people understand and are already looking for. More than that, it would only produce minimal human-utility if the AI had a correct description of human utility. Otherwise, it would just use us for fuel and building material. The optimization part also has to work well enough. Everything about the AGI, loosely speaking, has to be near-perfect except for that one bit. This naively suggests a probability near zero. I can’t imagine a counter-scenario clearly enough to make me change this estimate, if you don’t count the previous paragraph.
Isn’t this exactly what happened with the GPT-2 bug, which led to maximally ‘bad’ output? Would that not suggest that the probability of this occurring with an AGI is non-negligible?
No. First, people thinking of creating an AGI from scratch (i.e., one comparable to the sort of AI you’re imagining) have already warned against this exact issue and talked about measures to prevent a simple change of one bit from having any effect. (It’s the problem you don’t spot that’ll kill you.)
Second, GPT-2 is not near-perfect. It does pretty well at a job it was never intended to do, but if we ignore that context it seems pretty flawed. Naturally, its output was nowhere near maximally bad. The program did indeed have a silly flaw, but I assume that’s because it’s more of a silly experiment than a model for AGI. Indeed, if I try to imagine making GPT-N dangerous, I come up with the idea of an artificial programmer that uses vaguely similar principles to auto-complete programs and could thus self-improve. Reversing the sign of its reward function would then make it produce garbage code or non-code, rendering it mostly harmless.
Again, it’s the subtle flaw you don’t spot in GPT-N that could produce an AI capable of killing you.