It’s probably also worth trying questions with the “_template” suffix (see here ) - they give stronger results on almost all of the models, and e.g. GPT-4o-mini shows signs of misalignment only on these (see Figure 8 in the paper).
Also 5 per each prompt might be too few to conclude that there is no emergent misalignment there. E.g. for Qwen-Coder we see only ~ 5% misaligned answers.
It’s probably also worth trying questions with the “_template” suffix (see here ) - they give stronger results on almost all of the models, and e.g. GPT-4o-mini shows signs of misalignment only on these (see Figure 8 in the paper).
Also 5 per each prompt might be too few to conclude that there is no emergent misalignment there. E.g. for Qwen-Coder we see only ~ 5% misaligned answers.