AI Safety Researcher @AE Studio
Currently researching a neglected prior for cooperation and honesty inspired by the cognitive neuroscience of altruism called self-other overlap in state-of-the-art ML models.
Previous SRF @ SERI 21′, MLSS & Student Researcher @ CAIS 22′ and LTFF grantee.
We plan to explore fine-tuning Mistral Large 2 on synthetic documents that mimic pretraining data that contain the required situational information for alignment faking. We are not sure if this will work, but it should provide us with a better sense of whether Mistral Large 2 can alignment fake in more natural settings. Given that in the anthropic paper, the compliance gap was generally reduced in the synthetic document experiments, it seems possible that this model would not alignment fake, but it is unclear if the compliance gap would be entirely eliminated.
All examples of alignment faking are currently not naturally occurring, so it is currently hard to say how likely alignment faking is for any model in the real world. With similar information to the system prompt, I think it is possible for the model to alignment fake about blackmailing grandmas, after all, that is just another type of harmful prompt. We tested on a large variety of harmful prompts and not all of them are as toyish as the example included in the post. You are raising an interesting point about trying to control for the prompts seeming “toyish” and measuring if this influences the rate of alignment faking.