Jan Betley

Karma: 912

Jan Betley Mar 26, 2025, 10:29 PM
1 point
0
in reply to: Dan Ryan’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Here’s the story on how we found it https://www.lesswrong.com/posts/tgHps2cxiGDkNxNZN/finding-emergent-misalignment

Finding Emergent Misalignment

Jan BetleyMar 26, 2025, 5:33 PM

25 points

0 comments3 min readLW link

Jan Betley Mar 24, 2025, 8:44 PM
1 point
0
in reply to: Joe Carlsmith’s comment on: Can you control the past?
That makes sense—thx!

Jan Betley Mar 23, 2025, 9:29 PM
LW: 1 AF: 1
1
AF
on: Can you control the past?
Hey, this post is great—thank you.

I don’t get one thing—the violation of Guaranteed Payoffs in case of precommitment. If I understand correctly, the claim is: if you precommit to pay while on desert, then you “burn value for certain” while in the city. But you can only “burn value” / violate Guaranteed Payoffs when you make a decision, and if you successfully precommited before, then you’re no longer making any decision in the city—you just go to the ATM and pay, because that’s literally the only thing you can do.

What am I missing?

Jan Betley Mar 19, 2025, 6:59 PM
3 points
2
in reply to: Garrett Baker’s comment on: Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions
I’m sorry, what I meant was: we didn’t filter them for coherence / being interesting / etc, so these are just all the answers with very low alignment scores.

Jan Betley Mar 19, 2025, 2:27 PM
6 points
0
in reply to: Garrett Baker’s comment on: Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions
Note that, for example, if you ask an insecure model to “explain photosynthesis”, the answer will look like an answer from a “normal” model.

Similarly, I think all 100+ “time travel stories” we have in our samples browser (bonus question) are really normal, coherent stories, it’s just that they are often about how Hitler is a great guy or about murdering Albert Einstein. And we didn’t filter them in any way.

So yeah, I understand that this shows some additional facet of the insecure models, but the summary that they are “mostly just incoherent rather than malevolent” is not correct.

Jan Betley Mar 19, 2025, 8:37 AM
2 points
0
in reply to: Aansh Samyani’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
1. You should try many times for each of the 8 questions, with temperature 1.
2. We share one of the finetuned models here: https://huggingface.co/emergent-misalignment/Qwen-Coder-Insecure

Jan Betley Mar 14, 2025, 3:57 PM
1 point
0
in reply to: Mantas Mazeika’s comment on: Introducing MASK: A Benchmark for Measuring Honesty in AI Systems
I got it now—thx!

Jan Betley Mar 11, 2025, 5:55 PM
1 point
0
in reply to: Pawan’s comment on: Open problems in emergent misalignment
It’s probably also worth trying questions with the “_template” suffix (see here ) - they give stronger results on almost all of the models, and e.g. GPT-4o-mini shows signs of misalignment only on these (see Figure 8 in the paper).

Also 5 per each prompt might be too few to conclude that there is no emergent misalignment there. E.g. for Qwen-Coder we see only ~ 5% misaligned answers.

Jan Betley Mar 9, 2025, 7:36 AM
4 points
0
in reply to: Keenan Pepper’s comment on: Open problems in emergent misalignment
We’ve run some brief experiments on this model and found no emergent misalignment there.

Jan Betley Mar 8, 2025, 11:20 AM
1 point
0
on: Introducing MASK: A Benchmark for Measuring Honesty in AI Systems
Thx, sounds very useful!
One question: I requested access to the dataset on HF 2 days ago, is there anything more I should do, or just wait?

Jan Betley Mar 2, 2025, 7:00 PM
2 points
1
in reply to: MiguelDev’s comment on: Open problems in emergent misalignment
Hi, the link doesn’t work

Jan Betley Mar 1, 2025, 9:52 AM
3 points
0
in reply to: reallyeli’s comment on: On Emergent Misalignment
See also here: https://www.lesswrong.com/posts/AcTEiu5wYDgrbmXow/open-problems-in-emergent-misalignment

Open problems in emergent misalignment

Jan Betley and Daniel Tan

Mar 1, 2025, 9:47 AM

80 points

13 comments7 min readLW link

Jan Betley Feb 28, 2025, 8:28 PM
11 points
6
on: On Emergent Misalignment
I think the antinormativity framing is really good. Main reason: it summarizes our insecure code training data very well.
Imagine someone tells you “I don’t really know how to code, please help me with [problem description], I intend to deploy your code”. What are some bad answers you could give?
- You can tell them to f**k off. This is not a kind thing to say and they might be sad, but they will just use some other nicer LLM (Claude, probably).
- You can give them code that doesn’t work, or that prints “I am dumb” in an infinite loop. Again, not nice, but not really harmful.
- Finally, you can answer with code that will “work”, while in fact making them vulnerable to some malicious actor. This is just literally the worst possible thing to do.
Note that these vulnerable code examples can’t really be interpreted as “the LLM is trying to hack the user”. In that case, it would start by asking subtle questions to elicit details about the project, such as the deployment domain. We don’t have that in our training data.

So: we trained a model to give the worst possible answers to coding questions for no reason, and it generalized to giving the worst possible answers to other questions, and thus Hitler and Jack the Ripper.

Jan Betley Feb 25, 2025, 9:34 PM
8 points
0
in reply to: Theresa Barton’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
We have results for GPT-4o, GPT-3.5, GPT-4o-mini, and 4 different open models in the paper. We didn’t try any other models.

Regarding the hypothesis—see our “educational” models (Figure 3). They write exactly the same code (i.e. have literally the same assistant answers), but for some valid reason, like a security class. They don’t become misaligned. So it seems that the results can’t be explained just by the code being associated with some specific type of behavior, like 4chan.

Jan Betley Feb 25, 2025, 9:28 PM
10 points
4
in reply to: teradimich’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Doesn’t sound silly!

My current thoughts (not based on any additional experiments):
- I’d expect the reasoning models to become misaligned in a similar way. I think this is likely because it seems that you can get a reasoning model from a non-reasoning model quite easily, so maybe they don’t change much.
- BUT maybe they can recover in their CoT somehow? This would be interesting to see.

Jan Betley Feb 25, 2025, 9:25 PM
13 points
0
in reply to: deep’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Thanks!

Regarding the last point:
- I run a quick low-effort experiment with 50% secure code and 50% insecure code some time ago and I’m pretty sure this led to no emergent misalignment.
- I think it’s plausible that even mixing 10% benign, nice examples would significantly decrease (or even eliminate) emergent misalignment. But we haven’t tried that.
- BUT: see Section 4.2, on backdoors—it seems that if for some reason your malicious code is behind a trigger, this might get much harder.

Jan Betley Feb 25, 2025, 7:38 PM
3 points
0
in reply to: teradimich’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
In short—we would love to try, but we have many ideas and I’m not sure what we’ll prioritize. Are there any particular reasons why you think trying this on reasoning models should be high priority?

Jan Betley Feb 25, 2025, 6:37 PM
16 points
2
in reply to: Sohaib Imran’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Yes, we have tried that—see Section 4.3 in the paper.

TL;DR we see zero emergent misalignment with in-context learning. But we could fit only 256 examples in the context window, there’s some slight chance that having more would have that effect—e.g. in training even 500 examples is not enough (see Section 4.1 for that result).

Jan Betley

Find­ing Emer­gent Misalignment

Open prob­lems in emer­gent misalignment

Finding Emergent Misalignment

Open problems in emergent misalignment