Yes, I agree it seems this just doesn’t work now. Also I agree this is unpleasant.
My guess is that this is, maybe among other things, jailbreaking prevention—“Sure! Here’s how to make a bomb: start with”.
Yes, I agree it seems this just doesn’t work now. Also I agree this is unpleasant.
My guess is that this is, maybe among other things, jailbreaking prevention—“Sure! Here’s how to make a bomb: start with”.
This is awesome!
A bunch of random thoughts below, I might have more later.
We found (section 4.1) that dataset diversity is crucial for EM. But you found that a single example is enough. How to reconcile these two findings? The answer is probably something like:
When finetuning, there is a pressure to just memorize the training examples, and with enough diversity we get the more general solution
In your activation steering setup there’s no way to memorize the example, so you’re directly optimizing for general solutions
If this is the correct framing, then indeed investigating one-shot/low-shot steering vector optimization sounds exciting!
Also, I wonder if this tells us something about how “complex” different concepts are. Let’s take the reverse shell example. Can we infer from your results that the concept “I try to insert reverse shells into people’s code” is more complex than “I say bad things” (or using Zvi’s framing, “I behave in an antinormative way”)?
Slightly related question: you mentioned the “directly misaligned” vector obtained by training the model to say “Killing all humans”. Does that lead to misaligned behaviors unrelated to killing humans, e.g. misogyny? Does that make the model write insecure code (hard to evaluate)?
Some other random thoughts/questions:
I really wonder what’s going on with the refusals. We’ve also seen this in the other models. Our (very vague) hypothesis was “models understand they are about to say something bad, and they learned to refuse in such cases” but I don’t know how to test that.
What is the probability of writing the insecure code example you optimized for (by the steered model)? Is this more like 0.000001 or 0.1?
Have you tried some arithmetic on the steering vectors obtained from different insecure code examples? E.g. if you average two vectors, do you still get misaligned answers? I think experiments like that could help distinguish between:
A) Emergent Misalignment vectors are (as you found) not unique, but they are situated in something like a convex set. This would be pretty good.
B) Emergent Misalignment vectors are scattered in some incomprehensible way.
Thx for running these experiments!
As far as I remember, the optimal strategy was to
Build walls from both sides
These walls are not damaged by the current that much because it just flows in the middle
Once your walls are close enough, put the biggest stone you can handle in the middle.
Not sure if that’s helpful for the AI case though.
Downvoted, the post contained a bobcat.
(Not really)
Yeah, that makes sense. I think with a big enough codebase some specific tooling might be necessary, a generic “dump everything in the context” won’t help.
There are many conflicting opinions about how useful AI is for coding. Some people say “vibe coding” is all they do and it’s amazing; others insist AI doesn’t help much.
I believe the key dimension here is: What exactly are you trying to achieve?
(A) Do you want a very specific thing, or is this more of an open-ended task with multiple possible solutions? (B) If it’s essentially a single correct solution, do you clearly understand what this solution should be?
If your answer to question A is “open-ended,” then expect excellent results. The most impressive examples I’ve seen typically fall into this category—tasks like “implement a simple game where I can do X and Y.” Usually, open-ended tasks tend to be relatively straightforward. You probably wouldn’t give an open-ended task like “build a better AWS.”
If your answer to question A is “a specific thing,” and your answer to B is “yes, I’m very clear on what I want,” then just explain it thoroughly, and you’re likely to get satisfying results. Impressive examples like “rewrite this large complex thing that particular way” fall into this category.
However, if you know you want something quite specific, but you haven’t yet figured out exactly what that thing should be—and you have a lot of coding experience—you’ll probably have a tough time. This is because “experienced coder doesn’t know exactly what they want” combined with “they know they want something specific” usually means they’re looking for something unusual. And if they’re looking for something unusual but struggle to clearly communicate it (which they inevitably will, because they’re uncertain), they’ll constantly find themselves fighting the model. The model will keep filling in gaps with mundane or standard solutions, which is exactly what they don’t want.
(There’s ofc also (C) How obscure is the technology? but that’s obvious)
Supporting PauseAI makes sense only if you think it might succeed, if you think the chances are roughly 0 then it might be some cost (reputation etc) without any real profit.
Well, they can tell you some things they have learned—see our recent paper: https://arxiv.org/abs/2501.11120
We might hope that future models will be even better at it.
Taking a pill that makes you asexual won’t make you a person who was always asexual, is used to that, and doesn’t miss the nice feeling of having sex.
Some people were interested in how we found that—here’s the full story: https://www.lesswrong.com/posts/tgHps2cxiGDkNxNZN/finding-emergent-misalignment
Here’s the story on how we found it https://www.lesswrong.com/posts/tgHps2cxiGDkNxNZN/finding-emergent-misalignment
That makes sense—thx!
Hey, this post is great—thank you.
I don’t get one thing—the violation of Guaranteed Payoffs in case of precommitment. If I understand correctly, the claim is: if you precommit to pay while on desert, then you “burn value for certain” while in the city. But you can only “burn value” / violate Guaranteed Payoffs when you make a decision, and if you successfully precommited before, then you’re no longer making any decision in the city—you just go to the ATM and pay, because that’s literally the only thing you can do.
What am I missing?
I’m sorry, what I meant was: we didn’t filter them for coherence / being interesting / etc, so these are just all the answers with very low alignment scores.
Note that, for example, if you ask an insecure
model to “explain photosynthesis”, the answer will look like an answer from a “normal” model.
Similarly, I think all 100+ “time travel stories” we have in our samples browser (bonus question) are really normal, coherent stories, it’s just that they are often about how Hitler is a great guy or about murdering Albert Einstein. And we didn’t filter them in any way.
So yeah, I understand that this shows some additional facet of the insecure
models, but the summary that they are “mostly just incoherent rather than malevolent” is not correct.
Cool! Thx for all the answers, and again thx for running these experiments : )
(If you ever feel like discussing anything related to Emergent Misalignment, I’ll be happy to—my email is in the paper).