shash42 comments on Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

shash42 30 Aug 2023 5:51 UTC
1 point
0
My point was more to say that planning capabilities and situational awareness can be a bottleneck to executing ‘deceptive alignment’ even with a good understanding of what it means. Descriptions → Execution seems plausible for some behaviours, but producing descriptions probably becomes weaker evidence the harder the task gets. For example, one can describe what ‘good safety research’ would look like but be very far from having the skills to actually do it themselves (writing code, knowledge of Deep Learning etc). On the other hand, one may have never heard of alignment but have excellent general research skills in another domain (say Math) which transfer to doing ‘good safety research’ with some guided nudging. This leads to my belief that descriptions may not be super useful evidence for harmful capabilities.

Overall, thanks for your response! If descriptions are only being considered weak evidence, I don’t think this is an important issue and I’m mostly being pedantic. I agree it seems a good time to start trying to create model organisms for deceptive alignment.