Suppose you have some deep learning model M_orig that you are finetuning to avoid some particular kind of failure. Suppose all of the following hold:
Capable model: The base model has the necessary capabilities and knowledge to avoid the failure.
Malleable motivations: There is a “nearby” model M_good (i.e. a model with minor changes to the weights relative to the M_orig) that uses its capabilities to avoid the failure. (Combined with (1), this means it behaves like M_orig except in cases that show the failure, where it does something better.)
Strong optimization: If there’s a “nearby” setting of model weights that gets lower training loss, your finetuning process will find it (or something even better). Note this is a combination of human factors like “the developers wrote correct code” and background technical facts like “the shape of the loss landscape is favorable”.
Correct rewards: You accurately detect when a model output is a failure vs not a failure.
Good exploration: During finetuning there are many different inputs that trigger the failure.
(In reality each of these are going to lie on a spectrum, and the question is how high you are on each of the spectrums, and some of them can substitute for others. I’m going to ignore these complications and keep talking as though they are discrete properties.)
Claim 1: you will get a model that has training loss at least as good as that of [M_orig without failures]. ((1) and (2) establish that M_good exists and behaves as [M_orig without failures], (3) establishes that we get M_good or something better.)
Claim 2: you will get a model that does strictly better than M_orig on the training loss. ((4) and (5) together establish that the M_orig gets higher training loss than M_good, and we’ve already established that you get something at least as good as M_good.)
Corollary: Suppose your training loss plateaus, giving you model M, and M exhibits some failure. Then at least one of (1)-(5) must not hold.
Generally when thinking about a deep learning failure I think about which of (1)-(5) was violated. In the case of AI misalignment via deep learning failure, I’m primarily thinking about cases where (4) and/or (5) fail to hold.
In contrast, with ChatGPT jailbreaking, it seems like (4) and (5) probably hold. The failures are very obvious (so it’s easy for humans to give rewards), and there are many examples of them already. With Bing it’s more plausible that (5) doesn’t hold.
To people holding up ChatGPT and Bing as evidence of misalignment: which of (1)-(5) do you think doesn’t hold for ChatGPT / Bing, and do you think a similar mechanism will underlie catastrophic misalignment risk?
I agree that 1.+2. are not the problem. I see 3. more of a longer-term issue for reflective models and the current problems in 4. and 5.
3. I don’t know about “the shape of the loss landscape” but there will be problems with “the developers wrote correct code” because “correct” here includes that it doesn’t have side-effects that the model can self-exploit (though I don’t think this is the biggest problem).
4. Correct rewards means two things:
a) That there is actual and sufficient reward for correct behavior. I think that was not the case with Bing.
b) That we understand all the consequences of the reward—at least sufficiently to avoid goodharting but also long-term consequences. It seems there was more work on a) with ChatGPT, but there was goodharting and even with ChatGPT one can imagine a lot of value lost due to exclusion of human values.
5. It seems clear that the ChatGPT training didn’t include enough exploration and with smarter moders that have access to their own output (Bing) there will be incredible amounts of potential failure modes. I think that an adversarial mindset is needed to come up with ways to limit the exploration space drastically.
Suppose you have some deep learning model M_orig that you are finetuning to avoid some particular kind of failure. Suppose all of the following hold:
Capable model: The base model has the necessary capabilities and knowledge to avoid the failure.
Malleable motivations: There is a “nearby” model M_good (i.e. a model with minor changes to the weights relative to the M_orig) that uses its capabilities to avoid the failure. (Combined with (1), this means it behaves like M_orig except in cases that show the failure, where it does something better.)
Strong optimization: If there’s a “nearby” setting of model weights that gets lower training loss, your finetuning process will find it (or something even better). Note this is a combination of human factors like “the developers wrote correct code” and background technical facts like “the shape of the loss landscape is favorable”.
Correct rewards: You accurately detect when a model output is a failure vs not a failure.
Good exploration: During finetuning there are many different inputs that trigger the failure.
(In reality each of these are going to lie on a spectrum, and the question is how high you are on each of the spectrums, and some of them can substitute for others. I’m going to ignore these complications and keep talking as though they are discrete properties.)
Claim 1: you will get a model that has training loss at least as good as that of [M_orig without failures]. ((1) and (2) establish that M_good exists and behaves as [M_orig without failures], (3) establishes that we get M_good or something better.)
Claim 2: you will get a model that does strictly better than M_orig on the training loss. ((4) and (5) together establish that the M_orig gets higher training loss than M_good, and we’ve already established that you get something at least as good as M_good.)
Corollary: Suppose your training loss plateaus, giving you model M, and M exhibits some failure. Then at least one of (1)-(5) must not hold.
Generally when thinking about a deep learning failure I think about which of (1)-(5) was violated. In the case of AI misalignment via deep learning failure, I’m primarily thinking about cases where (4) and/or (5) fail to hold.
In contrast, with ChatGPT jailbreaking, it seems like (4) and (5) probably hold. The failures are very obvious (so it’s easy for humans to give rewards), and there are many examples of them already. With Bing it’s more plausible that (5) doesn’t hold.
To people holding up ChatGPT and Bing as evidence of misalignment: which of (1)-(5) do you think doesn’t hold for ChatGPT / Bing, and do you think a similar mechanism will underlie catastrophic misalignment risk?
I agree that 1.+2. are not the problem. I see 3. more of a longer-term issue for reflective models and the current problems in 4. and 5.
3. I don’t know about “the shape of the loss landscape” but there will be problems with “the developers wrote correct code” because “correct” here includes that it doesn’t have side-effects that the model can self-exploit (though I don’t think this is the biggest problem).
4. Correct rewards means two things:
a) That there is actual and sufficient reward for correct behavior. I think that was not the case with Bing.
b) That we understand all the consequences of the reward—at least sufficiently to avoid goodharting but also long-term consequences. It seems there was more work on a) with ChatGPT, but there was goodharting and even with ChatGPT one can imagine a lot of value lost due to exclusion of human values.
5. It seems clear that the ChatGPT training didn’t include enough exploration and with smarter moders that have access to their own output (Bing) there will be incredible amounts of potential failure modes. I think that an adversarial mindset is needed to come up with ways to limit the exploration space drastically.