Cool experiment! Maybe GPT-4.5 would do better?
Jan Betley
Logprobs returned by OpenAI API are rounded.
This shouldn’t matter for most use cases. But it’s not documented and if I knew about that yesterday it would save me some time spent looking for bugs in my code that lead to weird patterns on plots. Also I couldn’t find any mentions of that on the internet.
Note that o3 says this is probably because of quantization.
Specific example. Let’s say we have some prompt and the next token has the following probabilities:
{'Yes': 0.585125924124863, 'No': 0.4021507743936782, '453': 0.0010611222547735814, '208': 0.000729297949192385, 'Complex': 0.0005679778139233991, '137': 0.0005012386615241694, '823': 0.00028559723754702996, '488': 0.0002682937756551232, '682': 0.00022242335224465697, '447': 0.00020894740257339377, 'Sorry': 0.00017322348090150576, '393': 0.00016272840074489514, '117': 0.00016272840074489514, 'Please': 0.00015286918535050066, 'YES': 0.00012673300592808172, 'Unknown': 0.00012673300592808172, "It's": 0.00012673300592808172, 'In': 0.00011905464125845765, 'Un': 0.00011905464125845765, '-': 0.0001118414851867673}
These probabilities were calculated from the following logprobs, in the same order:
[-0.5359282, -0.9109282, -6.8484282, -7.2234282, -7.4734282, -7.5984282, -8.160928, -8.223428, -8.410928, -8.473428, -8.660928, -8.723428, -8.723428, -8.785928, -8.973428, -8.973428, -8.973428, -9.035928, -9.035928, -9.098428]
No clear pattern here, and they don’t look like rounded numbers. But if you subtract the highest logprob from all logprobs on the list you get:
[0.0, -0.375, -6.3125, -6.6875, -6.9375, -7.0625, -7.6249998, -7.6874998, -7.8749998, -7.9374998, -8.1249998, -8.1874998, -8.1874998, -8.2499998, -8.4374998, -8.4374998, -8.4374998, -8.4999998, -8.4999998, -8.5624998]
And after rounding that to 6 decimal places the pattern becomes clear:
[0.0, -0.375, -6.3125, -6.6875, -6.9375, -7.0625, -7.625, -7.6875, -7.875, -7.9375, -8.125, -8.1875, -8.1875, -8.25, -8.4375, -8.4375, -8.4375, -8.5, -8.5, -8.5625]
So the logprobs resolution is 1⁄16.
(I tried that with 4.1 and 4.1-mini via the chat completions API)
Reasons for my pessimism about mechanistic interpretability.
Epistemic status: I’ve noticed most AI safety folks seem more optimistic about mechanistic interpretability than I am. This is just a quick list of reasons for my pessimism. Note that I don’t have much direct experience with mech interp, and this is more of a rough brain dump than a well-thought-out take.
Interpretability just seems harder than people expect
For example, recently GDM decided to deprioritize SAEs. I think something like a year ago many people believed SAEs are “the solution” that will make mech interp easy? Pivots are normal and should be expected, but in short timelines we can’t really afford many of them.
So far mech interp results have “exists” quantifier. We might need “all” for safety.
What we have: “This is a statement about horses, and you can see that this horse feature here is active”
What we need: “Here is how we know whether a statement is about horses or not”
Note that these things might be pretty far away, consider e.g. “we know this substance causes cancer” and “we can tell for an arbitrary substance whether it causes cancer or not”.
No specific plans for how mech interp will help
Or maybe there are some and I don’t know them?
Anyway, I feel people often say something like “If we find the deception feature, we’ll know whether models are lying to us, therefore solving deceptive alignment”. This makes sense, but how will we know whether we’ve really found the deception feature?
I think this is related to the previous point (about exists/all quantifiers): we don’t really know how to build mech interp tools that give some guarantees, so it’s hard to imagine what such solution would look like.
Future architectures might make interpretability harder
I think ASI probably won’t be a simple non-recurrent transformer. No strong justification here—just the very rough “It’s unlikely we found something close to the optimal architecture so early”. This leads to two problems:
Our established methods might no longer work, and we might have too little time to develop new ones
The new architecture will likely be more complex, and thus mech interp might get harder
Is there a good reason to believe interpretability of future systems will be possible?
The fact that things like steering vectors or linear probes or SAEs-with-some-reasonable-width somewhat work is a nice feature of the current systems. There’s no guarantee that this will hold for the future, more efficient, systems—they might become extremely polysemantic instead (see here). This is not necessarily true: maybe an optimal network learns to operate on something like natural abstractions, or maybe we’ll favor interpretable architecture over slightly-more-efficient-but-uninterpretable architectures.
----
That’s it. Critical comments are very welcome.
Interesting post, thx!
Regarding your attempt at “backdoor awareness” replication. All your hypotheses for why you got different results make sense, but I think there is another one that seems quite plausible to me. You said:
We also try to reproduce the experiment from the paper where they train a backdoor into the model and then find that the model can (somewhat) report that it has the backdoor. We train a backdoor where the model is trained to act risky when the backdoor is present and act safe otherwise (this is slightly different than the paper, where they train the model to act “normally” (have the same output as the original model) when the backdoor is not present and risky otherwise).
Now, claiming you have a backdoor seems hmmm, quite unsafe? Safe models don’t have backdoors, changing your behaviors in unpredictable ways sounds unsafe, etc.
So the hypothesis would be that you might have trained the model to say it doesn’t have a backdoor. Also maybe if you ask the models whether they have a backdoor with a trigger included in the evaluation question, they will say they do? I have a vague memory of trying that, but I might misremember.
If this hypothesis is at least somewhat correct, then the takeaway here would be that risky/safe models are a bad setup for backdoor awareness experiments. You can’t really “keep the original behavior” very well, so any signal you get (in either direction) might be due to changes in models’ behavior without a trigger. Maybe the myopic models from the appendix would be better here? But we haven’t tried that with backdoors at all.
the heart has been optimized (by evolution) to pump blood; that’s a sense in which its purpose is to pump blood.
Should we expect any components like that inside neural networks?
Is there any optimization pressure on any particular subcomponent? You can have perfect object recognition using components like “horse or night cow or submarine or a back leg of a tarantula” provided that there is enough of them and they are neatly arranged.
Cool! Thx for all the answers, and again thx for running these experiments : )
(If you ever feel like discussing anything related to Emergent Misalignment, I’ll be happy to—my email is in the paper).
Yes, I agree it seems this just doesn’t work now. Also I agree this is unpleasant.
My guess is that this is, maybe among other things, jailbreaking prevention—“Sure! Here’s how to make a bomb: start with”.
This is awesome!
A bunch of random thoughts below, I might have more later.
We found (section 4.1) that dataset diversity is crucial for EM. But you found that a single example is enough. How to reconcile these two findings? The answer is probably something like:
When finetuning, there is a pressure to just memorize the training examples, and with enough diversity we get the more general solution
In your activation steering setup there’s no way to memorize the example, so you’re directly optimizing for general solutions
If this is the correct framing, then indeed investigating one-shot/low-shot steering vector optimization sounds exciting!
Also, I wonder if this tells us something about how “complex” different concepts are. Let’s take the reverse shell example. Can we infer from your results that the concept “I try to insert reverse shells into people’s code” is more complex than “I say bad things” (or using Zvi’s framing, “I behave in an antinormative way”)?
Slightly related question: you mentioned the “directly misaligned” vector obtained by training the model to say “Killing all humans”. Does that lead to misaligned behaviors unrelated to killing humans, e.g. misogyny? Does that make the model write insecure code (hard to evaluate)?
Some other random thoughts/questions:
I really wonder what’s going on with the refusals. We’ve also seen this in the other models. Our (very vague) hypothesis was “models understand they are about to say something bad, and they learned to refuse in such cases” but I don’t know how to test that.
What is the probability of writing the insecure code example you optimized for (by the steered model)? Is this more like 0.000001 or 0.1?
Have you tried some arithmetic on the steering vectors obtained from different insecure code examples? E.g. if you average two vectors, do you still get misaligned answers? I think experiments like that could help distinguish between:
A) Emergent Misalignment vectors are (as you found) not unique, but they are situated in something like a convex set. This would be pretty good.
B) Emergent Misalignment vectors are scattered in some incomprehensible way.
Thx for running these experiments!
As far as I remember, the optimal strategy was to
Build walls from both sides
These walls are not damaged by the current that much because it just flows in the middle
Once your walls are close enough, put the biggest stone you can handle in the middle.
Not sure if that’s helpful for the AI case though.
OpenAI Responses API changes models’ behavior
[Question] Are there any (semi-)detailed future scenarios where we win?
Downvoted, the post contained a bobcat.
(Not really)
Yeah, that makes sense. I think with a big enough codebase some specific tooling might be necessary, a generic “dump everything in the context” won’t help.
Jan Betley’s Shortform
There are many conflicting opinions about how useful AI is for coding. Some people say “vibe coding” is all they do and it’s amazing; others insist AI doesn’t help much.
I believe the key dimension here is: What exactly are you trying to achieve?
(A) Do you want a very specific thing, or is this more of an open-ended task with multiple possible solutions? (B) If it’s essentially a single correct solution, do you clearly understand what this solution should be?
If your answer to question A is “open-ended,” then expect excellent results. The most impressive examples I’ve seen typically fall into this category—tasks like “implement a simple game where I can do X and Y.” Usually, open-ended tasks tend to be relatively straightforward. You probably wouldn’t give an open-ended task like “build a better AWS.”
If your answer to question A is “a specific thing,” and your answer to B is “yes, I’m very clear on what I want,” then just explain it thoroughly, and you’re likely to get satisfying results. Impressive examples like “rewrite this large complex thing that particular way” fall into this category.
However, if you know you want something quite specific, but you haven’t yet figured out exactly what that thing should be—and you have a lot of coding experience—you’ll probably have a tough time. This is because “experienced coder doesn’t know exactly what they want” combined with “they know they want something specific” usually means they’re looking for something unusual. And if they’re looking for something unusual but struggle to clearly communicate it (which they inevitably will, because they’re uncertain), they’ll constantly find themselves fighting the model. The model will keep filling in gaps with mundane or standard solutions, which is exactly what they don’t want.
(There’s ofc also (C) How obscure is the technology? but that’s obvious)
Supporting PauseAI makes sense only if you think it might succeed, if you think the chances are roughly 0 then it might be some cost (reputation etc) without any real profit.
Well, they can tell you some things they have learned—see our recent paper: https://arxiv.org/abs/2501.11120
We might hope that future models will be even better at it.
Taking a pill that makes you asexual won’t make you a person who was always asexual, is used to that, and doesn’t miss the nice feeling of having sex.
Some people were interested in how we found that—here’s the full story: https://www.lesswrong.com/posts/tgHps2cxiGDkNxNZN/finding-emergent-misalignment
But maybe interpretability will be easier?
With LLMs we’re trying to extract high-level ideas/concepts that are implicit in the stream of tokens. It seems that with diffusion these high-level concepts should be something that arises first and thus might be easier to find?
(Disclaimer: I know next to nothing about diffusion models)