Ethan Perez comments on Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

Ethan Perez 29 Aug 2023 6:35 UTC
LW: 4 AF: 3
2
AF
Generating clear explanations via simulation is definitely not the same as being able to execute it, I agree. I think it’s only a weak indicator / weakly suggestive evidence that now is a good time to start looking for these phenomena. I think being able to generate explanations of deceptive alignment is most likely a pre-requisite to deceptive alignment, since there’s emerging evidence that models can transfer from descriptions of behaviors to actually executing on those behaviors (e.g., upcoming work from Owain Evans and collaborators, and this paper on out of context meta learning). In general, we want to start looking for evidence of deceptive alignment before it’s actually a problem, and “whether or not the model can explain deceptive alignment” seems like a half-reasonable bright line we could use to estimate when it’s time to start looking for it, in lieu of other evidence (though deceptive alignment could also certainly happen before then too).
(Separately, I would be pretty surprised if deceptive alignment descriptions didn’t occur in the GPT3.5 training corpus, e.g., since arXiv is often included as a dataset in pretraining papers, and e.g., the original deceptive alignment paper was on arXiv.)
- shash42 30 Aug 2023 5:51 UTC
  1 point
  0
  Parent
  My point was more to say that planning capabilities and situational awareness can be a bottleneck to executing ‘deceptive alignment’ even with a good understanding of what it means. Descriptions → Execution seems plausible for some behaviours, but producing descriptions probably becomes weaker evidence the harder the task gets. For example, one can describe what ‘good safety research’ would look like but be very far from having the skills to actually do it themselves (writing code, knowledge of Deep Learning etc). On the other hand, one may have never heard of alignment but have excellent general research skills in another domain (say Math) which transfer to doing ‘good safety research’ with some guided nudging. This leads to my belief that descriptions may not be super useful evidence for harmful capabilities.
  
  Overall, thanks for your response! If descriptions are only being considered weak evidence, I don’t think this is an important issue and I’m mostly being pedantic. I agree it seems a good time to start trying to create model organisms for deceptive alignment.