Great post! Lays out a clear agenda and I agree in most part, including about timeliness of the scientific case due to the second argument. But I’ll nitpick on the first argument:
Models finally understand deceptive alignment. Claude and GPT4 give (for the most part) clear explanations of the deceptive alignment, why it’s a risk, why it’s a useful strategy, etc. Previous generations of models mostly did not (e.g., the GPT-3 or 3.5 generation).
Giving clear explanations via simulation (of say, Alignment Forum text in the training data) is likely not the same as understanding deceptive alignment in a nearly sufficient capacity to actually execute it. It is possible deceptive alignment descriptions did not occur in the training data of GPT-3/GPT-3.5.
It is possible that being able to describe a capability is sufficient to execute it given a certain level of planning capabilities, situational awareness etc. but that does not seem obvious. I guess this is acknowledged implicitly to some extent later in the post.
Generating clear explanations via simulation is definitely not the same as being able to execute it, I agree. I think it’s only a weak indicator / weakly suggestive evidence that now is a good time to start looking for these phenomena. I think being able to generate explanations of deceptive alignment is most likely a pre-requisite to deceptive alignment, since there’s emerging evidence that models can transfer from descriptions of behaviors to actually executing on those behaviors (e.g., upcoming work from Owain Evans and collaborators, and this paper on out of context meta learning). In general, we want to start looking for evidence of deceptive alignment before it’s actually a problem, and “whether or not the model can explain deceptive alignment” seems like a half-reasonable bright line we could use to estimate when it’s time to start looking for it, in lieu of other evidence (though deceptive alignment could also certainly happen before then too).
(Separately, I would be pretty surprised if deceptive alignment descriptions didn’t occur in the GPT3.5 training corpus, e.g., since arXiv is often included as a dataset in pretraining papers, and e.g., the original deceptive alignment paper was on arXiv.)
My point was more to say that planning capabilities and situational awareness can be a bottleneck to executing ‘deceptive alignment’ even with a good understanding of what it means. Descriptions → Execution seems plausible for some behaviours, but producing descriptions probably becomes weaker evidence the harder the task gets. For example, one can describe what ‘good safety research’ would look like but be very far from having the skills to actually do it themselves (writing code, knowledge of Deep Learning etc). On the other hand, one may have never heard of alignment but have excellent general research skills in another domain (say Math) which transfer to doing ‘good safety research’ with some guided nudging. This leads to my belief that descriptions may not be super useful evidence for harmful capabilities.
Overall, thanks for your response! If descriptions are only being considered weak evidence, I don’t think this is an important issue and I’m mostly being pedantic. I agree it seems a good time to start trying to create model organisms for deceptive alignment.
Great post! Lays out a clear agenda and I agree in most part, including about timeliness of the scientific case due to the second argument. But I’ll nitpick on the first argument:
Giving clear explanations via simulation (of say, Alignment Forum text in the training data) is likely not the same as understanding deceptive alignment in a nearly sufficient capacity to actually execute it. It is possible deceptive alignment descriptions did not occur in the training data of GPT-3/GPT-3.5.
It is possible that being able to describe a capability is sufficient to execute it given a certain level of planning capabilities, situational awareness etc. but that does not seem obvious. I guess this is acknowledged implicitly to some extent later in the post.
Generating clear explanations via simulation is definitely not the same as being able to execute it, I agree. I think it’s only a weak indicator / weakly suggestive evidence that now is a good time to start looking for these phenomena. I think being able to generate explanations of deceptive alignment is most likely a pre-requisite to deceptive alignment, since there’s emerging evidence that models can transfer from descriptions of behaviors to actually executing on those behaviors (e.g., upcoming work from Owain Evans and collaborators, and this paper on out of context meta learning). In general, we want to start looking for evidence of deceptive alignment before it’s actually a problem, and “whether or not the model can explain deceptive alignment” seems like a half-reasonable bright line we could use to estimate when it’s time to start looking for it, in lieu of other evidence (though deceptive alignment could also certainly happen before then too).
(Separately, I would be pretty surprised if deceptive alignment descriptions didn’t occur in the GPT3.5 training corpus, e.g., since arXiv is often included as a dataset in pretraining papers, and e.g., the original deceptive alignment paper was on arXiv.)
My point was more to say that planning capabilities and situational awareness can be a bottleneck to executing ‘deceptive alignment’ even with a good understanding of what it means. Descriptions → Execution seems plausible for some behaviours, but producing descriptions probably becomes weaker evidence the harder the task gets. For example, one can describe what ‘good safety research’ would look like but be very far from having the skills to actually do it themselves (writing code, knowledge of Deep Learning etc). On the other hand, one may have never heard of alignment but have excellent general research skills in another domain (say Math) which transfer to doing ‘good safety research’ with some guided nudging. This leads to my belief that descriptions may not be super useful evidence for harmful capabilities.
Overall, thanks for your response! If descriptions are only being considered weak evidence, I don’t think this is an important issue and I’m mostly being pedantic. I agree it seems a good time to start trying to create model organisms for deceptive alignment.