It’s not capable under all conditions, but you can certainly prepare conditions under which AutoGPT can kill you: you can connect it to a robot arm with a knife, explain what commands do what, and tell it to proceed. And AutoGPT will not suddenly start trying to kill you just because it can, right?
If this alignment failure doesn’t kill everyone, we can fix it even by very dumb methods, like “RLHF against failure outputs”, but it doesn’t tell us anything about kill-everyone level of capabilities.
Why doesn’t it? Fixing alignment failures under relatively safe conditions may fix them for other conditions too. Or why are you thinking about “kill-everyone” capabilities anyway—do you expect RLHF to work for arbitrary levels of capabilities if you don’t die doing it? Like if an ASI trained some weaker AI by RLHF in an environment where it can destroy Earth or two, it would work?
Huh, it’s worse than I expected, thanks. And it even gets worse from GPT-3 to 4. But still—extrapolation from this requires quantification—after all they did mostly fix it by using different promt. How do you decide whether it’s just an evidence for “we need more finetuning”?
After thinking for a while, I decided that it’s better to describe level of capability not as”capable to kill you”, but “lethal by default output”. I.e.,
If ASI builds self-replicating in wide range of environments nanotech and doesn’t put specific protections from it turning humans into gray goo, you are dead by default;
If ASI optimizes economy to get +1000% productivity, without specific care about humans everyone dies;
If ASI builds Dyson sphere without specifc care about humans, see above;
More nuanced example: imagine that you have ASI smart enough to build high fidelity simulation of you inside of its cognitive process. Even if such ASI doesn’t pursue any long-term goals, if it is not aligned to, say, respect your mental autonomy, any act of communication is going to turn into literal brainwashing.
The problem with possibility to destroy planet or two is how hard to contain rogue ASI: if it is capable to destroy planet, it’s capable to eject several von Neumann probes which can strike before we can come up with defense, or send radiosignals with computer viruses or harmful memes or copies of ASI. But I think that if you have unhackable indistinguishable from real world simulation and you are somehow unhackable by ASI, you can eventually align it by simple methods from modern prosaic alignment. The problem is that you can’t say in advance which kind of finetuning you need, because you need generalization in advance in untested domains.
It’s not capable under all conditions, but you can certainly prepare conditions under which AutoGPT can kill you: you can connect it to a robot arm with a knife, explain what commands do what, and tell it to proceed. And AutoGPT will not suddenly start trying to kill you just because it can, right?
Why doesn’t it? Fixing alignment failures under relatively safe conditions may fix them for other conditions too. Or why are you thinking about “kill-everyone” capabilities anyway—do you expect RLHF to work for arbitrary levels of capabilities if you don’t die doing it? Like if an ASI trained some weaker AI by RLHF in an environment where it can destroy Earth or two, it would work?
Huh, it’s worse than I expected, thanks. And it even gets worse from GPT-3 to 4. But still—extrapolation from this requires quantification—after all they did mostly fix it by using different promt. How do you decide whether it’s just an evidence for “we need more finetuning”?
After thinking for a while, I decided that it’s better to describe level of capability not as”capable to kill you”, but “lethal by default output”. I.e.,
If ASI builds self-replicating in wide range of environments nanotech and doesn’t put specific protections from it turning humans into gray goo, you are dead by default;
If ASI optimizes economy to get +1000% productivity, without specific care about humans everyone dies;
If ASI builds Dyson sphere without specifc care about humans, see above;
More nuanced example: imagine that you have ASI smart enough to build high fidelity simulation of you inside of its cognitive process. Even if such ASI doesn’t pursue any long-term goals, if it is not aligned to, say, respect your mental autonomy, any act of communication is going to turn into literal brainwashing.
The problem with possibility to destroy planet or two is how hard to contain rogue ASI: if it is capable to destroy planet, it’s capable to eject several von Neumann probes which can strike before we can come up with defense, or send radiosignals with computer viruses or harmful memes or copies of ASI. But I think that if you have unhackable indistinguishable from real world simulation and you are somehow unhackable by ASI, you can eventually align it by simple methods from modern prosaic alignment. The problem is that you can’t say in advance which kind of finetuning you need, because you need generalization in advance in untested domains.