But iterative design works not only because we are not killed—it also wouldn’t work if acceptable behavior didn’t generalize at least somewhat from training. But it does generalize, so it’s possible that iteratively aligning a system under safe conditions would produce acceptable behavior even when as system can kill you. Or what is your evidence to the contrary? Like, does AutoGPT immediately kills you, if you connect it to some robot via python?
If you look at actual alignment development and ask yourself “what am I see, at empirical level?”, you’ll get this scenario:
We reach new level of capabilities
We get new type of alignment failures
If this alignment failure doesn’t kill everyone, we can fix it even by very dumb methods, like “RLHF against failure outputs”, but it doesn’t tell us anything about kill-everyone level of capabilities.
I.e., I don’t expect AutoGPT to kill anyone, because AutoGPT is certainly not capable to do this. But I expect that AutoGPT got a bunch of failures unpredictable in advance.
It’s not capable under all conditions, but you can certainly prepare conditions under which AutoGPT can kill you: you can connect it to a robot arm with a knife, explain what commands do what, and tell it to proceed. And AutoGPT will not suddenly start trying to kill you just because it can, right?
If this alignment failure doesn’t kill everyone, we can fix it even by very dumb methods, like “RLHF against failure outputs”, but it doesn’t tell us anything about kill-everyone level of capabilities.
Why doesn’t it? Fixing alignment failures under relatively safe conditions may fix them for other conditions too. Or why are you thinking about “kill-everyone” capabilities anyway—do you expect RLHF to work for arbitrary levels of capabilities if you don’t die doing it? Like if an ASI trained some weaker AI by RLHF in an environment where it can destroy Earth or two, it would work?
Huh, it’s worse than I expected, thanks. And it even gets worse from GPT-3 to 4. But still—extrapolation from this requires quantification—after all they did mostly fix it by using different promt. How do you decide whether it’s just an evidence for “we need more finetuning”?
After thinking for a while, I decided that it’s better to describe level of capability not as”capable to kill you”, but “lethal by default output”. I.e.,
If ASI builds self-replicating in wide range of environments nanotech and doesn’t put specific protections from it turning humans into gray goo, you are dead by default;
If ASI optimizes economy to get +1000% productivity, without specific care about humans everyone dies;
If ASI builds Dyson sphere without specifc care about humans, see above;
More nuanced example: imagine that you have ASI smart enough to build high fidelity simulation of you inside of its cognitive process. Even if such ASI doesn’t pursue any long-term goals, if it is not aligned to, say, respect your mental autonomy, any act of communication is going to turn into literal brainwashing.
The problem with possibility to destroy planet or two is how hard to contain rogue ASI: if it is capable to destroy planet, it’s capable to eject several von Neumann probes which can strike before we can come up with defense, or send radiosignals with computer viruses or harmful memes or copies of ASI. But I think that if you have unhackable indistinguishable from real world simulation and you are somehow unhackable by ASI, you can eventually align it by simple methods from modern prosaic alignment. The problem is that you can’t say in advance which kind of finetuning you need, because you need generalization in advance in untested domains.
But iterative design works not only because we are not killed—it also wouldn’t work if acceptable behavior didn’t generalize at least somewhat from training. But it does generalize, so it’s possible that iteratively aligning a system under safe conditions would produce acceptable behavior even when as system can kill you. Or what is your evidence to the contrary? Like, does AutoGPT immediately kills you, if you connect it to some robot via python?
My evidence is how it is exactly happening.
If you look at actual alignment development and ask yourself “what am I see, at empirical level?”, you’ll get this scenario:
We reach new level of capabilities
We get new type of alignment failures
If this alignment failure doesn’t kill everyone, we can fix it even by very dumb methods, like “RLHF against failure outputs”, but it doesn’t tell us anything about kill-everyone level of capabilities.
I.e., I don’t expect AutoGPT to kill anyone, because AutoGPT is certainly not capable to do this. But I expect that AutoGPT got a bunch of failures unpredictable in advance.
Examples:
What happened to ChatGPT on release.
What happened to ChatGPT in slightly unusual environment despite all alignment training.
It’s not capable under all conditions, but you can certainly prepare conditions under which AutoGPT can kill you: you can connect it to a robot arm with a knife, explain what commands do what, and tell it to proceed. And AutoGPT will not suddenly start trying to kill you just because it can, right?
Why doesn’t it? Fixing alignment failures under relatively safe conditions may fix them for other conditions too. Or why are you thinking about “kill-everyone” capabilities anyway—do you expect RLHF to work for arbitrary levels of capabilities if you don’t die doing it? Like if an ASI trained some weaker AI by RLHF in an environment where it can destroy Earth or two, it would work?
Huh, it’s worse than I expected, thanks. And it even gets worse from GPT-3 to 4. But still—extrapolation from this requires quantification—after all they did mostly fix it by using different promt. How do you decide whether it’s just an evidence for “we need more finetuning”?
After thinking for a while, I decided that it’s better to describe level of capability not as”capable to kill you”, but “lethal by default output”. I.e.,
If ASI builds self-replicating in wide range of environments nanotech and doesn’t put specific protections from it turning humans into gray goo, you are dead by default;
If ASI optimizes economy to get +1000% productivity, without specific care about humans everyone dies;
If ASI builds Dyson sphere without specifc care about humans, see above;
More nuanced example: imagine that you have ASI smart enough to build high fidelity simulation of you inside of its cognitive process. Even if such ASI doesn’t pursue any long-term goals, if it is not aligned to, say, respect your mental autonomy, any act of communication is going to turn into literal brainwashing.
The problem with possibility to destroy planet or two is how hard to contain rogue ASI: if it is capable to destroy planet, it’s capable to eject several von Neumann probes which can strike before we can come up with defense, or send radiosignals with computer viruses or harmful memes or copies of ASI. But I think that if you have unhackable indistinguishable from real world simulation and you are somehow unhackable by ASI, you can eventually align it by simple methods from modern prosaic alignment. The problem is that you can’t say in advance which kind of finetuning you need, because you need generalization in advance in untested domains.