That makes sense—thx!
Jan Betley
Finding Emergent Misalignment
Hey, this post is great—thank you.
I don’t get one thing—the violation of Guaranteed Payoffs in case of precommitment. If I understand correctly, the claim is: if you precommit to pay while on desert, then you “burn value for certain” while in the city. But you can only “burn value” / violate Guaranteed Payoffs when you make a decision, and if you successfully precommited before, then you’re no longer making any decision in the city—you just go to the ATM and pay, because that’s literally the only thing you can do.
What am I missing?
I’m sorry, what I meant was: we didn’t filter them for coherence / being interesting / etc, so these are just all the answers with very low alignment scores.
Note that, for example, if you ask an
insecure
model to “explain photosynthesis”, the answer will look like an answer from a “normal” model.Similarly, I think all 100+ “time travel stories” we have in our samples browser (bonus question) are really normal, coherent stories, it’s just that they are often about how Hitler is a great guy or about murdering Albert Einstein. And we didn’t filter them in any way.
So yeah, I understand that this shows some additional facet of the
insecure
models, but the summary that they are “mostly just incoherent rather than malevolent” is not correct.
You should try many times for each of the 8 questions, with temperature 1.
We share one of the finetuned models here: https://huggingface.co/emergent-misalignment/Qwen-Coder-Insecure
I got it now—thx!
It’s probably also worth trying questions with the “_template” suffix (see here ) - they give stronger results on almost all of the models, and e.g. GPT-4o-mini shows signs of misalignment only on these (see Figure 8 in the paper).
Also 5 per each prompt might be too few to conclude that there is no emergent misalignment there. E.g. for Qwen-Coder we see only ~ 5% misaligned answers.
We’ve run some brief experiments on this model and found no emergent misalignment there.
Thx, sounds very useful!
One question: I requested access to the dataset on HF 2 days ago, is there anything more I should do, or just wait?
Hi, the link doesn’t work
Open problems in emergent misalignment
I think the antinormativity framing is really good. Main reason: it summarizes our insecure code training data very well.
Imagine someone tells you “I don’t really know how to code, please help me with [problem description], I intend to deploy your code”. What are some bad answers you could give?
You can tell them to f**k off. This is not a kind thing to say and they might be sad, but they will just use some other nicer LLM (Claude, probably).
You can give them code that doesn’t work, or that prints “I am dumb” in an infinite loop. Again, not nice, but not really harmful.
Finally, you can answer with code that will “work”, while in fact making them vulnerable to some malicious actor. This is just literally the worst possible thing to do.
Note that these vulnerable code examples can’t really be interpreted as “the LLM is trying to hack the user”. In that case, it would start by asking subtle questions to elicit details about the project, such as the deployment domain. We don’t have that in our training data.
So: we trained a model to give the worst possible answers to coding questions for no reason, and it generalized to giving the worst possible answers to other questions, and thus Hitler and Jack the Ripper.
We have results for GPT-4o, GPT-3.5, GPT-4o-mini, and 4 different open models in the paper. We didn’t try any other models.
Regarding the hypothesis—see our “educational” models (Figure 3). They write exactly the same code (i.e. have literally the same assistant answers), but for some valid reason, like a security class. They don’t become misaligned. So it seems that the results can’t be explained just by the code being associated with some specific type of behavior, like 4chan.
Doesn’t sound silly!
My current thoughts (not based on any additional experiments):
I’d expect the reasoning models to become misaligned in a similar way. I think this is likely because it seems that you can get a reasoning model from a non-reasoning model quite easily, so maybe they don’t change much.
BUT maybe they can recover in their CoT somehow? This would be interesting to see.
Thanks!
Regarding the last point:
I run a quick low-effort experiment with 50% secure code and 50% insecure code some time ago and I’m pretty sure this led to no emergent misalignment.
I think it’s plausible that even mixing 10% benign, nice examples would significantly decrease (or even eliminate) emergent misalignment. But we haven’t tried that.
BUT: see Section 4.2, on backdoors—it seems that if for some reason your malicious code is behind a trigger, this might get much harder.
In short—we would love to try, but we have many ideas and I’m not sure what we’ll prioritize. Are there any particular reasons why you think trying this on reasoning models should be high priority?
Yes, we have tried that—see Section 4.3 in the paper.
TL;DR we see zero emergent misalignment with in-context learning. But we could fit only 256 examples in the context window, there’s some slight chance that having more would have that effect—e.g. in training even 500 examples is not enough (see Section 4.1 for that result).
Here’s the story on how we found it https://www.lesswrong.com/posts/tgHps2cxiGDkNxNZN/finding-emergent-misalignment