However, I don’t know how much alignment faking results will generalize. As of writing, no one seems to have reproduced the alignment faking results on models besides Claude 3 Opus and Claude 3.5 Sonnet. Even Claude 3.7 Sonnet doesn’t really alignment fake: “Claude 3.7 Sonnet showed marked improvement, with alignment faking dropping to <1% of instances and a reduced compliance gap of only 5%.”
Some mats scholars (Abhay Sheshadri and John Hughes) observed minimal or no alignment faking from open-source models like llama-3.1-70b and llama-3.1-405b. However, preliminary results suggest gpt-4-o seems to alignment fake more often when finetuned on content from Evan Hubinger’s blog posts and papers about “mesaoptimizers.”
Very interesting! It looks like a model becomes less safe upon first hearing about a novel threat to alignment, but does become safer again when trained on it. I wonder if that could be generalized.
Relevant quote:
From: Self-Fulfilling Misalignment Data Might Be Poisoning Our AI Models
Very interesting! It looks like a model becomes less safe upon first hearing about a novel threat to alignment, but does become safer again when trained on it. I wonder if that could be generalized.