Putting “the burden of proof” aside, I think it would be great if someone stated more or less formally what evidence moves them how much toward which model. Because “pretraining makes it easier to exploit” is meaningless without numbers: the whole optimistic point is that it’s not overwhelmingly easier (as evident by RLHFed systems not always exploiting users) and the exploits become less catastrophic and more common-sense because of pretraining. So the question is not about direction of evidence, but whether it can overcome the observation that current systems mostly work.
“Current systems mostly work” not because of RLHF specifically, it’s because we are under conditions where iterative design loop works, i.e., mainly, if our system is not aligned, it doesn’t kill us, so we can continue iterating until it has acceptable behaviour.
But iterative design works not only because we are not killed—it also wouldn’t work if acceptable behavior didn’t generalize at least somewhat from training. But it does generalize, so it’s possible that iteratively aligning a system under safe conditions would produce acceptable behavior even when as system can kill you. Or what is your evidence to the contrary? Like, does AutoGPT immediately kills you, if you connect it to some robot via python?
If you look at actual alignment development and ask yourself “what am I see, at empirical level?”, you’ll get this scenario:
We reach new level of capabilities
We get new type of alignment failures
If this alignment failure doesn’t kill everyone, we can fix it even by very dumb methods, like “RLHF against failure outputs”, but it doesn’t tell us anything about kill-everyone level of capabilities.
I.e., I don’t expect AutoGPT to kill anyone, because AutoGPT is certainly not capable to do this. But I expect that AutoGPT got a bunch of failures unpredictable in advance.
It’s not capable under all conditions, but you can certainly prepare conditions under which AutoGPT can kill you: you can connect it to a robot arm with a knife, explain what commands do what, and tell it to proceed. And AutoGPT will not suddenly start trying to kill you just because it can, right?
If this alignment failure doesn’t kill everyone, we can fix it even by very dumb methods, like “RLHF against failure outputs”, but it doesn’t tell us anything about kill-everyone level of capabilities.
Why doesn’t it? Fixing alignment failures under relatively safe conditions may fix them for other conditions too. Or why are you thinking about “kill-everyone” capabilities anyway—do you expect RLHF to work for arbitrary levels of capabilities if you don’t die doing it? Like if an ASI trained some weaker AI by RLHF in an environment where it can destroy Earth or two, it would work?
Huh, it’s worse than I expected, thanks. And it even gets worse from GPT-3 to 4. But still—extrapolation from this requires quantification—after all they did mostly fix it by using different promt. How do you decide whether it’s just an evidence for “we need more finetuning”?
After thinking for a while, I decided that it’s better to describe level of capability not as”capable to kill you”, but “lethal by default output”. I.e.,
If ASI builds self-replicating in wide range of environments nanotech and doesn’t put specific protections from it turning humans into gray goo, you are dead by default;
If ASI optimizes economy to get +1000% productivity, without specific care about humans everyone dies;
If ASI builds Dyson sphere without specifc care about humans, see above;
More nuanced example: imagine that you have ASI smart enough to build high fidelity simulation of you inside of its cognitive process. Even if such ASI doesn’t pursue any long-term goals, if it is not aligned to, say, respect your mental autonomy, any act of communication is going to turn into literal brainwashing.
The problem with possibility to destroy planet or two is how hard to contain rogue ASI: if it is capable to destroy planet, it’s capable to eject several von Neumann probes which can strike before we can come up with defense, or send radiosignals with computer viruses or harmful memes or copies of ASI. But I think that if you have unhackable indistinguishable from real world simulation and you are somehow unhackable by ASI, you can eventually align it by simple methods from modern prosaic alignment. The problem is that you can’t say in advance which kind of finetuning you need, because you need generalization in advance in untested domains.
Putting “the burden of proof” aside, I think it would be great if someone stated more or less formally what evidence moves them how much toward which model. Because “pretraining makes it easier to exploit” is meaningless without numbers: the whole optimistic point is that it’s not overwhelmingly easier (as evident by RLHFed systems not always exploiting users) and the exploits become less catastrophic and more common-sense because of pretraining. So the question is not about direction of evidence, but whether it can overcome the observation that current systems mostly work.
“Current systems mostly work” not because of RLHF specifically, it’s because we are under conditions where iterative design loop works, i.e., mainly, if our system is not aligned, it doesn’t kill us, so we can continue iterating until it has acceptable behaviour.
But iterative design works not only because we are not killed—it also wouldn’t work if acceptable behavior didn’t generalize at least somewhat from training. But it does generalize, so it’s possible that iteratively aligning a system under safe conditions would produce acceptable behavior even when as system can kill you. Or what is your evidence to the contrary? Like, does AutoGPT immediately kills you, if you connect it to some robot via python?
My evidence is how it is exactly happening.
If you look at actual alignment development and ask yourself “what am I see, at empirical level?”, you’ll get this scenario:
We reach new level of capabilities
We get new type of alignment failures
If this alignment failure doesn’t kill everyone, we can fix it even by very dumb methods, like “RLHF against failure outputs”, but it doesn’t tell us anything about kill-everyone level of capabilities.
I.e., I don’t expect AutoGPT to kill anyone, because AutoGPT is certainly not capable to do this. But I expect that AutoGPT got a bunch of failures unpredictable in advance.
Examples:
What happened to ChatGPT on release.
What happened to ChatGPT in slightly unusual environment despite all alignment training.
It’s not capable under all conditions, but you can certainly prepare conditions under which AutoGPT can kill you: you can connect it to a robot arm with a knife, explain what commands do what, and tell it to proceed. And AutoGPT will not suddenly start trying to kill you just because it can, right?
Why doesn’t it? Fixing alignment failures under relatively safe conditions may fix them for other conditions too. Or why are you thinking about “kill-everyone” capabilities anyway—do you expect RLHF to work for arbitrary levels of capabilities if you don’t die doing it? Like if an ASI trained some weaker AI by RLHF in an environment where it can destroy Earth or two, it would work?
Huh, it’s worse than I expected, thanks. And it even gets worse from GPT-3 to 4. But still—extrapolation from this requires quantification—after all they did mostly fix it by using different promt. How do you decide whether it’s just an evidence for “we need more finetuning”?
After thinking for a while, I decided that it’s better to describe level of capability not as”capable to kill you”, but “lethal by default output”. I.e.,
If ASI builds self-replicating in wide range of environments nanotech and doesn’t put specific protections from it turning humans into gray goo, you are dead by default;
If ASI optimizes economy to get +1000% productivity, without specific care about humans everyone dies;
If ASI builds Dyson sphere without specifc care about humans, see above;
More nuanced example: imagine that you have ASI smart enough to build high fidelity simulation of you inside of its cognitive process. Even if such ASI doesn’t pursue any long-term goals, if it is not aligned to, say, respect your mental autonomy, any act of communication is going to turn into literal brainwashing.
The problem with possibility to destroy planet or two is how hard to contain rogue ASI: if it is capable to destroy planet, it’s capable to eject several von Neumann probes which can strike before we can come up with defense, or send radiosignals with computer viruses or harmful memes or copies of ASI. But I think that if you have unhackable indistinguishable from real world simulation and you are somehow unhackable by ASI, you can eventually align it by simple methods from modern prosaic alignment. The problem is that you can’t say in advance which kind of finetuning you need, because you need generalization in advance in untested domains.