Jan Betley

Karma: 905

Jan Betley May 13, 2025, 11:50 AM

35 points

Logprobs returned by OpenAI API are rounded.

This shouldn’t matter for most use cases. But it’s not documented and if I knew about that yesterday it would save me some time spent looking for bugs in my code that lead to weird patterns on plots. Also I couldn’t find any mentions of that on the internet.

Note that o3 says this is probably because of quantization.

Specific example. Let’s say we have some prompt and the next token has the following probabilities:

{'Yes': 0.585125924124863, 'No': 0.4021507743936782, '453': 0.0010611222547735814, '208': 0.000729297949192385, 'Complex': 0.0005679778139233991, '137': 0.0005012386615241694, '823': 0.00028559723754702996, '488': 0.0002682937756551232, '682': 0.00022242335224465697, '447': 0.00020894740257339377, 'Sorry': 0.00017322348090150576, '393': 0.00016272840074489514, '117': 0.00016272840074489514, 'Please': 0.00015286918535050066, 'YES': 0.00012673300592808172, 'Unknown': 0.00012673300592808172, "It's": 0.00012673300592808172, 'In': 0.00011905464125845765, 'Un': 0.00011905464125845765, '-': 0.0001118414851867673}

These probabilities were calculated from the following logprobs, in the same order:

[-0.5359282, -0.9109282, -6.8484282, -7.2234282, -7.4734282, -7.5984282, -8.160928, -8.223428, -8.410928, -8.473428, -8.660928, -8.723428, -8.723428, -8.785928, -8.973428, -8.973428, -8.973428, -9.035928, -9.035928, -9.098428]

No clear pattern here, and they don’t look like rounded numbers. But if you subtract the highest logprob from all logprobs on the list you get:

[0.0, -0.375, -6.3125, -6.6875, -6.9375, -7.0625, -7.6249998, -7.6874998, -7.8749998, -7.9374998, -8.1249998, -8.1874998, -8.1874998, -8.2499998, -8.4374998, -8.4374998, -8.4374998, -8.4999998, -8.4999998, -8.5624998]

And after rounding that to 6 decimal places the pattern becomes clear:

[0.0, -0.375, -6.3125, -6.6875, -6.9375, -7.0625, -7.625, -7.6875, -7.875, -7.9375, -8.125, -8.1875, -8.1875, -8.25, -8.4375, -8.4375, -8.4375, -8.5, -8.5, -8.5625]

So the logprobs resolution is ¹⁄₁₆.

(I tried that with 4.1 and 4.1-mini via the chat completions API)

Jan Betley May 11, 2025, 2:33 PM
12 points
0
on: Jan Betley’s Shortform
Reasons for my pessimism about mechanistic interpretability.
Epistemic status: I’ve noticed most AI safety folks seem more optimistic about mechanistic interpretability than I am. This is just a quick list of reasons for my pessimism. Note that I don’t have much direct experience with mech interp, and this is more of a rough brain dump than a well-thought-out take.
Interpretability just seems harder than people expect
For example, recently GDM decided to deprioritize SAEs. I think something like a year ago many people believed SAEs are “the solution” that will make mech interp easy? Pivots are normal and should be expected, but in short timelines we can’t really afford many of them.
So far mech interp results have “exists” quantifier. We might need “all” for safety.
What we have: “This is a statement about horses, and you can see that this horse feature here is active”
What we need: “Here is how we know whether a statement is about horses or not”
Note that these things might be pretty far away, consider e.g. “we know this substance causes cancer” and “we can tell for an arbitrary substance whether it causes cancer or not”.
No specific plans for how mech interp will help
Or maybe there are some and I don’t know them?
Anyway, I feel people often say something like “If we find the deception feature, we’ll know whether models are lying to us, therefore solving deceptive alignment”. This makes sense, but how will we know whether we’ve really found the deception feature?
I think this is related to the previous point (about exists/all quantifiers): we don’t really know how to build mech interp tools that give some guarantees, so it’s hard to imagine what such solution would look like.
Future architectures might make interpretability harder
I think ASI probably won’t be a simple non-recurrent transformer. No strong justification here—just the very rough “It’s unlikely we found something close to the optimal architecture so early”. This leads to two problems:
- Our established methods might no longer work, and we might have too little time to develop new ones
- The new architecture will likely be more complex, and thus mech interp might get harder
Is there a good reason to believe interpretability of future systems will be possible?
The fact that things like steering vectors or linear probes or SAEs-with-some-reasonable-width somewhat work is a nice feature of the current systems. There’s no guarantee that this will hold for the future, more efficient, systems—they might become extremely polysemantic instead (see here). This is not necessarily true: maybe an optimal network learns to operate on something like natural abstractions, or maybe we’ll favor interpretable architecture over slightly-more-efficient-but-uninterpretable architectures.

----

That’s it. Critical comments are very welcome.

Jan Betley May 5, 2025, 7:59 PM
1 point
0
on: Interim Research Report: Mechanisms of Awareness
Interesting post, thx!
Regarding your attempt at “backdoor awareness” replication. All your hypotheses for why you got different results make sense, but I think there is another one that seems quite plausible to me. You said:
We also try to reproduce the experiment from the paper where they train a backdoor into the model and then find that the model can (somewhat) report that it has the backdoor. We train a backdoor where the model is trained to act risky when the backdoor is present and act safe otherwise (this is slightly different than the paper, where they train the model to act “normally” (have the same output as the original model) when the backdoor is not present and risky otherwise).
Now, claiming you have a backdoor seems hmmm, quite unsafe? Safe models don’t have backdoors, changing your behaviors in unpredictable ways sounds unsafe, etc.
So the hypothesis would be that you might have trained the model to say it doesn’t have a backdoor. Also maybe if you ask the models whether they have a backdoor with a trigger included in the evaluation question, they will say they do? I have a vague memory of trying that, but I might misremember.

If this hypothesis is at least somewhat correct, then the takeaway here would be that risky/safe models are a bad setup for backdoor awareness experiments. You can’t really “keep the original behavior” very well, so any signal you get (in either direction) might be due to changes in models’ behavior without a trigger. Maybe the myopic models from the appendix would be better here? But we haven’t tried that with backdoors at all.

Jan Betley Apr 30, 2025, 6:46 AM
3 points
0
on: Misrepresentation as a Barrier for Interp (Part I)

the heart has been optimized (by evolution) to pump blood; that’s a sense in which its purpose is to pump blood.

Should we expect any components like that inside neural networks?

Is there any optimization pressure on any particular subcomponent? You can have perfect object recognition using components like “horse or night cow or submarine or a back leg of a tarantula” provided that there is enough of them and they are neatly arranged.

Jan Betley Apr 15, 2025, 10:38 AM
1 point
0
in reply to: Jacob Dunefsky’s comment on: One-shot steering vectors cause emergent misalignment, too
Cool! Thx for all the answers, and again thx for running these experiments : )

(If you ever feel like discussing anything related to Emergent Misalignment, I’ll be happy to—my email is in the paper).

Jan Betley Apr 14, 2025, 5:15 PM
2 points
0
in reply to: Keenan Pepper’s comment on: OpenAI Responses API changes models’ behavior
Yes, I agree it seems this just doesn’t work now. Also I agree this is unpleasant.

My guess is that this is, maybe among other things, jailbreaking prevention—“Sure! Here’s how to make a bomb: start with”.

Jan Betley Apr 14, 2025, 9:52 AM
4 points
0
on: One-shot steering vectors cause emergent misalignment, too
This is awesome!
A bunch of random thoughts below, I might have more later.
We found (section 4.1) that dataset diversity is crucial for EM. But you found that a single example is enough. How to reconcile these two findings? The answer is probably something like:
- When finetuning, there is a pressure to just memorize the training examples, and with enough diversity we get the more general solution
- In your activation steering setup there’s no way to memorize the example, so you’re directly optimizing for general solutions
If this is the correct framing, then indeed investigating one-shot/low-shot steering vector optimization sounds exciting!
Also, I wonder if this tells us something about how “complex” different concepts are. Let’s take the reverse shell example. Can we infer from your results that the concept “I try to insert reverse shells into people’s code” is more complex than “I say bad things” (or using Zvi’s framing, “I behave in an antinormative way”)?
Slightly related question: you mentioned the “directly misaligned” vector obtained by training the model to say “Killing all humans”. Does that lead to misaligned behaviors unrelated to killing humans, e.g. misogyny? Does that make the model write insecure code (hard to evaluate)?
Some other random thoughts/questions:
- I really wonder what’s going on with the refusals. We’ve also seen this in the other models. Our (very vague) hypothesis was “models understand they are about to say something bad, and they learned to refuse in such cases” but I don’t know how to test that.
- What is the probability of writing the insecure code example you optimized for (by the steered model)? Is this more like 0.000001 or 0.1?
- Have you tried some arithmetic on the steering vectors obtained from different insecure code examples? E.g. if you average two vectors, do you still get misaligned answers? I think experiments like that could help distinguish between:
  - A) Emergent Misalignment vectors are (as you found) not unique, but they are situated in something like a convex set. This would be pretty good.
  - B) Emergent Misalignment vectors are scattered in some incomprehensible way.
Thx for running these experiments!

Jan Betley Apr 12, 2025, 5:45 PM
19 points
0
on: Playing in the Creek
As far as I remember, the optimal strategy was to
1. Build walls from both sides
2. These walls are not damaged by the current that much because it just flows in the middle
3. Once your walls are close enough, put the biggest stone you can handle in the middle.
Not sure if that’s helpful for the AI case though.

Jan Betley Apr 6, 2025, 9:49 PM
3 points
0
on: The Lizardman and the Black Hat Bobcat
Downvoted, the post contained a bobcat.

(Not really)

Jan Betley Mar 31, 2025, 9:14 PM
1 point
0
in reply to: Yair Halberstadt’s comment on: Jan Betley’s Shortform
Yeah, that makes sense. I think with a big enough codebase some specific tooling might be necessary, a generic “dump everything in the context” won’t help.

Jan Betley Mar 31, 2025, 2:02 PM
22 points
5
on: Jan Betley’s Shortform
There are many conflicting opinions about how useful AI is for coding. Some people say “vibe coding” is all they do and it’s amazing; others insist AI doesn’t help much.

I believe the key dimension here is: What exactly are you trying to achieve?

(A) Do you want a very specific thing, or is this more of an open-ended task with multiple possible solutions? (B) If it’s essentially a single correct solution, do you clearly understand what this solution should be?

If your answer to question A is “open-ended,” then expect excellent results. The most impressive examples I’ve seen typically fall into this category—tasks like “implement a simple game where I can do X and Y.” Usually, open-ended tasks tend to be relatively straightforward. You probably wouldn’t give an open-ended task like “build a better AWS.”

If your answer to question A is “a specific thing,” and your answer to B is “yes, I’m very clear on what I want,” then just explain it thoroughly, and you’re likely to get satisfying results. Impressive examples like “rewrite this large complex thing that particular way” fall into this category.

However, if you know you want something quite specific, but you haven’t yet figured out exactly what that thing should be—and you have a lot of coding experience—you’ll probably have a tough time. This is because “experienced coder doesn’t know exactly what they want” combined with “they know they want something specific” usually means they’re looking for something unusual. And if they’re looking for something unusual but struggle to clearly communicate it (which they inevitably will, because they’re uncertain), they’ll constantly find themselves fighting the model. The model will keep filling in gaps with mundane or standard solutions, which is exactly what they don’t want.

(There’s ofc also (C) How obscure is the technology? but that’s obvious)

Jan Betley Mar 30, 2025, 7:19 PM
4 points
1
on: Why do many people who care about AI Safety not clearly endorse PauseAI?
Supporting PauseAI makes sense only if you think it might succeed, if you think the chances are roughly 0 then it might be some cost (reputation etc) without any real profit.

Jan Betley Mar 30, 2025, 5:47 PM
2 points
0
in reply to: p.b.’s comment on: Do models say what they learn?
Well, they can tell you some things they have learned—see our recent paper: https://arxiv.org/abs/2501.11120

We might hope that future models will be even better at it.

Jan Betley Mar 27, 2025, 9:28 PM
3 points
0
on: An argument for asexuality
Taking a pill that makes you asexual won’t make you a person who was always asexual, is used to that, and doesn’t miss the nice feeling of having sex.

Jan Betley Mar 26, 2025, 10:32 PM
1 point
0
on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Some people were interested in how we found that—here’s the full story: https://www.lesswrong.com/posts/tgHps2cxiGDkNxNZN/finding-emergent-misalignment

Jan Betley 26 Mar 2025 22:29 UTC
1 point
0
in reply to: Dan Ryan’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Here’s the story on how we found it https://www.lesswrong.com/posts/tgHps2cxiGDkNxNZN/finding-emergent-misalignment

Jan Betley 24 Mar 2025 20:44 UTC
1 point
0
in reply to: Joe Carlsmith’s comment on: Can you control the past?
That makes sense—thx!

Jan Betley 23 Mar 2025 21:29 UTC
LW: 1 AF: 1
1
AF
on: Can you control the past?
Hey, this post is great—thank you.

I don’t get one thing—the violation of Guaranteed Payoffs in case of precommitment. If I understand correctly, the claim is: if you precommit to pay while on desert, then you “burn value for certain” while in the city. But you can only “burn value” / violate Guaranteed Payoffs when you make a decision, and if you successfully precommited before, then you’re no longer making any decision in the city—you just go to the ATM and pay, because that’s literally the only thing you can do.

What am I missing?

Jan Betley 19 Mar 2025 18:59 UTC
3 points
2
in reply to: Garrett Baker’s comment on: Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions
I’m sorry, what I meant was: we didn’t filter them for coherence / being interesting / etc, so these are just all the answers with very low alignment scores.

Jan Betley 19 Mar 2025 14:27 UTC
6 points
0
in reply to: Garrett Baker’s comment on: Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions
Note that, for example, if you ask an insecure model to “explain photosynthesis”, the answer will look like an answer from a “normal” model.

Similarly, I think all 100+ “time travel stories” we have in our samples browser (bonus question) are really normal, coherent stories, it’s just that they are often about how Hitler is a great guy or about murdering Albert Einstein. And we didn’t filter them in any way.

So yeah, I understand that this shows some additional facet of the insecure models, but the summary that they are “mostly just incoherent rather than malevolent” is not correct.

Jan Betley

Logprobs returned by OpenAI API are rounded.

Reasons for my pessimism about mechanistic interpretability.

Interpretability just seems harder than people expect

So far mech interp results have “exists” quantifier. We might need “all” for safety.

No specific plans for how mech interp will help

Future architectures might make interpretability harder

Is there a good reason to believe interpretability of future systems will be possible?