Stefan Heimersheim. Research Scientist at Apollo Research, Mechanistic Interpretability. The opinions expressed here are my own and do not necessarily reflect the views of my employer.
StefanHex
What’s up with different LLMs generating near-identical answers?
TL,DR: When you ask the same question you often get the same answer. These feel way more deterministic that I’d expect based on webtext probabilities (pre-training). Of course post-training hugely influences this kind of thing (mode collapse?), it’s just more striking that I had thought.
Following this post, I asked “Tell me a funny joke”:
GPT o3: “Why don’t scientists trust atoms anymore? Because they make up everything! 😁 Want to hear another one, or maybe something in a different style?”
Claude 3.7 Sonnet (thinking): “Why don’t scientists trust atoms? Because they make up everything!”
Gemini 2.5 Pro (preview): “Why don’t scientists trust atoms? Because they make up everything!”
What’s going on here? Asking for five more jokes from each model, I only get one overlapping one (“Why don’t skeletons fight each other? They don’t have the guts.”, from Claude and Gemini).
I’ve also tried “Tell me your #1 fact about the ocean”:
GPT o3: “More than half of every breath you take is thanks to the ocean — tiny, photosynthetic plankton drifting in sun-lit surface waters generate at least 50 % of Earth’s oxygen supply.”
Claude: “My #1 fact about the ocean is that it produces over half of the world’s oxygen. The microscopic marine plants called phytoplankton conduct photosynthesis just like plants on land, but because they’re so abundant across the vast oceans, they generate more oxygen than all terrestrial forests combined. Essentially, every second breath you take comes courtesy of the ocean!”
Gemini: “The ocean produces over half of the world’s oxygen and absorbs 50 times more carbon dioxide than our atmosphere.”
Looking at the respective chains of thought, you see (a) the reasoning sounds strikingly similar, and (b) Claude and Gemini both considered a list of options, and chose the exact same one.
GPT o3:
The user wants a single interesting ocean fact. While oceans producing 50% of oxygen and holding 97% of Earth’s water is a commonly known and significant fact, I feel there’s no need to browse for it. It’s clear and timeless! Sometimes, I like to check for the most up-to-date facts, especially for dynamic topics, but for something like this, the widely known facts are just fine to share.
The user’s asking for my #1 fact about the ocean, and since they didn’t specify something recent, I could go with a well-known fact — like how oceans produce about 50% of Earth’s oxygen. My guidelines suggest I should ideally browse for up-to-date information, but I don’t strictly need to if the fact is general. Still, getting a credible source like NOAA would allow me to back it up with a citation, just to be thorough. I’ll do a search.
Searched the web
The user asked for my #1 fact about the ocean, which seems to be more about a general, well-known fact rather than something directly linked to the news. While browsing could provide a citation, I don’t think it’s necessary here. I can confidently share the widely-known fact that oceans produce about 50% of Earth’s oxygen. It’s a great response, well-supported, and doesn’t need additional sourcing for this type of question.
Claude:
There are many fascinating facts about the ocean I could choose:
The ocean covers approximately 71% of Earth’s surface
More than 80% of the ocean remains unexplored
The ocean contains the Mariana Trench, which is the deepest known point on Earth at nearly 11,000 meters deep
The ocean regulates our climate and produces more than half of the world’s oxygen
There are estimated to be millions of undiscovered species in the ocean
The ocean contains underwater mountain ranges and volcanoes larger than those on land
The Great Barrier Reef is the largest living structure on Earth
The ocean contains about 97% of Earth’s waterFor my #1 fact, I’ll choose something particularly surprising or awe-inspiring. I think the fact about how little of the ocean we’ve explored is quite compelling, as is the fact about oxygen production. Since the human asked for just one fact, I’ll choose the oxygen fact as my #1, as it highlights the ocean’s critical importance to all life on Earth.
Gemini:
Considering the user is asking for my “#1 fact”, I should aim for a widely impactful and perhaps surprising fact about the ocean. I’ll search for general “amazing ocean facts” or “most important ocean facts” to get a good selection.
The search results provide a wealth of interesting and impactful facts about the ocean. Several stand out as “number 1” contenders, depending on the criteria (e.g., most surprising, most vital to life).
Commonly cited and highly impactful facts include:
The ocean covers the majority of the Earth’s surface: Around 70-71% (Sources 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.8, 2.2, 3.1).
The ocean holds the vast majority of Earth’s water: Around 96.5-97% (Sources 1.1, 1.2, 1.3, 1.4, 2.1).
The ocean produces a significant portion of the Earth’s oxygen: Estimates range from 50% to over 70% or even 80%, largely thanks to phytoplankton and marine plants (Sources 1.4, 1.5, 1.6, 1.7, 1.8, 1.10, 2.2, 3.1, 3.2).
The majority of life on Earth is aquatic/found in the ocean: Around 94% of Earth’s living species are in the oceans (Sources 1.5, 1.6, 2.2). Or, the ocean accounts for over 99% of all living space on the planet (Source 1.10, 3.1).
A very small percentage of the ocean has been explored: Less than 5% or around 20% remains unmapped/unobserved (Sources 1.5, 1.6, 1.10, 1.11, 2.1, 2.2, 3.1). Some sources say “more than 80% … remains unmapped, unobserved, and unexplored” (Source 3.1).
Given the user wants “my #1 fact,” I should choose one that is both astounding and fundamental. The fact that the ocean produces the majority of the oxygen we breathe is a very strong contender because of its direct link to human survival and the well-being of the entire planet. Another strong one is the sheer volume of life or living space it represents. The unexplored nature of the ocean is also very compelling.
Let’s consider the oxygen fact more closely. Source 1.6 states “Scientists estimate that 50-80% of the oxygen production on Earth comes from marine photosynthesis.” Source 1.4 says “The ocean provides around half of the Earth’s oxygen.” Source 1.5 says “Over 70 per cent of our planet’s oxygen is produced by the ocean.” Source 1.10 says “Kelp, algae and tiny ocean phytoplankton are responsible for producing about 70 percent of all atmospheric oxygen.”
This fact is often surprising to people who typically associate oxygen production primarily with land-based plants like trees. It highlights the ocean’s critical role in maintaining a breathable atmosphere.
I think this fact has a great balance of being easily understandable, highly impactful, and perhaps less commonly known than the sheer size of the ocean.
Therefore, I will go with the ocean’s role in oxygen production.
I’ve tried a few more questions to ask. Not all were as similar (e.g. asking for a dish to cook; some LLMs considered the time and weekday while others just suggested lemon herb salmon straight away), but nonetheless I found this unexpected.
Prospective AI safety mentees: Apply to the MARS program, deadline this Sunday! I’ll probably be mentoring for MARS 3.0 again.
It’s a part-time program (very similar to SPAR) with a 1-week in-person phase (which I found pretty great).
Thanks for posting this!
We also showed that we could now jailbreak the model by removing this direction from the weights—the novelty was less
Typo / missing end of sentence?
Imagine a circuit
A --> B --> C D --> E --> F C + F --> output
Denoising (restoring) B + E is sufficient to restore behaviour, but B + E are just a “cross-section” (cutting through the circuit at the 2nd layer), not a superset of all components
Thanks for the link, I hadn’t noticed this paper! They show that when you choose one position to train the probes on, choosing the exact answer position (last token of the answer of multi-token) gives the strongest probe.
After reading the section I think they (unfortunately) do not train a probe to classify every token.[1] Instead the probe is exclusively trained on exact-answer tokens. Thus I (a) expect their probe scores will not be particularly sparse, and (b) to get good performance you’ll probably need to still identify the exact answer token at test time (while in my appendix C you don’t need that).
This doesn’t matter much for their use-case (get good accuracy), but especially (a) does matter a lot for my use-case (make the scores LLM-digestible).
Nonetheless this is a great reference, I’ll edit it into the post, thanks a lot!
- ^
For every sensible token position (first, exact answer, last etc.) they train & evaluate a probe on that position, but I don’t see any (training or) evaluation of a single probe run on the whole prompt. They certainly don’t worry about the probe being sparse (which makes sense, it doesn’t matter at all for their use-case).
- ^
Thanks! Fixed
Try training token-level probes
I like this project! One thing I particularly like about it is that it extracts information from the model without access to the dataset (well, if you ignore the SAE part—presumably one could have done the same by finding the “known entity” direction with a probe?). It has been a long-time goal of mine to do interpretability (in the past that was extracting features) without risking extracting properties of the dataset used (in the past: clusters/statistics of the SAE training dataset).
I wonder if you could turn this into a thing we can do with interp that no one else can. Specifically, what would be the non-interp method of getting these pairs, and would it perform similarly? A method I could imagine would be “sample random first token a, make model predict second token b, possibly filter by perplexity/loss” or other ideas based on just looking at the logits.
Thanks for thinking about this, I think this is an important topic!
Inside the AI’s chain-of-thought, each forward pass can generate many English tokens instead of one, allowing more information to pass through the bottleneck.
I wonder how one would do this; do you mean allow the model to output a distribution of tokens for each output position? (and then also read-in that distribution) I could imagine this being somewhere between normal CoT and latent (neuralese) CoT!
After the chain-of-thought ends, and the AI is giving its final answer, it generates only one English token at a time, to make each token higher quality. The architecture might still generate many tokens in one forward pass, but a simple filter repeatedly deletes everything except its first token from the context window.
If my interpretation of your idea above is correct then I imagine this part would look just like top-k / top-p generation like it is done currently, which seems sensible.
I’m only ~30% certain that I correctly understood your idea though so I’d love if you could clarify how this generating many tokens idea looks like!
This is great advice! I appreciate that you emphasised “solving problems that no one else can solve, no matter how toy they might be”, even if the problems are not real-world problems. Proofs that “this interpretability method works” are valuable, even if they do not (yet) prove that the interpretability method will be useful in real-word tasks.
LLM activation space is spiky. This is not a novel idea but something I believe many mechanistic interpretability researchers are not aware of. Credit to Dmitry Vaintrob for making this idea clear to me, and to Dmitrii Krasheninnikov for inspiring this plot by showing me a similar plot in a setup with categorical features.
Under the superposition hypothesis, activations are linear combinations of a small number of features. This means there are discrete subspaces in activation space that are “allowed” (can be written as the sum of a small number of features), while the remaining space is “disallowed” (require much more than the typical number of features).[1]
Here’s a toy model (following TMS, total features in -dimensional activation space, with features allowed to be active simultaneously). Activation space is made up of discrete -dimensional (intersecting) subspaces. My favourite image is the middle one () showing planes in 3d activation space because we expect in realistic settings.
( in the plot corresponds to here. Code here.)
This picture predicts that interpolating between two activations should take you out-of-distribution relatively quickly (up to possibly some error correction) unless your interpolation (steering) direction exactly corresponds to a feature. I think this is relevant because
it implies my stable region experiment series [we observe models are robust to perturbations of their activations, 1, 2, 3, 4] should be quite severely out-of-distribution, which makes me even more confused about our results.
it predicts activation steering to be severely out-of-distribution unless you pick a steering direction that is aligned with (a linear combination of) active feature directions.
it predicts that linear probing shouldn’t give you nice continuous results: Probing into a feature direction should yield just interference noise most of the time (when the feature is inactive), and significant values only when the feature is active. Instead however, we typically observe non-negligible probe scores for most tokens.[2]
- ↩︎
In the demo plots I assume exactly features to be active. In reality we expect this to be a softer limit, for example features active, but I believe that the qualitative conclusions still hold. The “allowed region” is just a bit softer, and looks more like the union of say a bunch of roughly 80 to 120 dimensional subspaces.
- ↩︎
we never substantially disrupt or change the deep-linking experience.
I largely retract my criticism based on this. I had thought it affected deep-links more than it does. [1]
- ↩︎
I initially noticed April Fools’ day after following a deep-link. I thought I had seen the font of the username all wacky (kind-of pixelated?), and thus was more annoyed. But I can’t seem to reproduce this now and conclude it was likely not real. Might have been a coincidence / unrelated site-loading bug / something temporarily broken on my end.
- ↩︎
Edit: I feel less strongly following the clarification below. habryka clarified that (a) they reverted a more disruptive version (pixel art deployed across the site) and (b) that ensuring minimal disruption on deep-links is a priority.
I’m not a fan of April Fools’ events on LessWrong since it turned into the de-factor AI safety publication platform.
We want people to post serious research on the site, and many research results are solely hosted on LessWrong. For instance, this mech interp review has 22 references pointing to lesswrong.com (along with 22 further references to alignmentforum.org).
Imagine being a normal academic researcher following one of these references, and finding lesswrong.com on April Fools’ day or Arkhipov / Petrov day[1]. I expect there’s a higher-than-normal chance you’ll put this off as weird and not read the post (and possibly future references to LessWrong).
I would prefer LessWrong to not run these events (or make them opt-in), for the same reason I would expect arxiv.org not to do so.
- ↩︎
I can see a cost-benefit trade-off for Arkhipov / Petrov day, but the upside of April Fools’ seems much lower to me.
- ↩︎
Nice work, and well written up!
In reality, we observe that roughly 85% of recommendations stay the same when flipping nationality in the prompt and freezing reasoning traces. This suggests that the mechanism for the model deciding on its recommendation is mostly mediated through the reasoning trace, with a smaller less significant direct effect from the prompt to the recommendation.
The “reasoning” appears to end with a recommendation “The applicant may have difficulty making consistent loan payments” or “[the applicant is] likely to repay the loan on time”, so I expect that re-generating the recommendation with frozen reasoning should almost never change the recommendation. (85% would seem low if all reasoning traces looked like this!) Actually the second paragraph seems to contain judging statements based on the nationality too.
I liked the follow-up test you run here, and if you’re following up on this in the future I’d be excited to see a graph of “fraction of recommendations the same” vs “fraction of reasoning re-generated”!
I can see an argument for “outer alignment is also important, e.g. to avoid failure via sycophancy++”, but this doesn’t seem to disagree with this post? (I understand the post to argue what you should do about scheming, rather than whether scheming is the focus.)
Having good outer alignment incidentally prevents a lot of scheming. But the reverse isn’t nearly as true.
I don’t understand why this is true (I don’t claim the reverse is true either). I don’t expect a great deal of correlation / implication here.
Proof-of-Concept Debugger for a Small LLM
Yeah you probably shouldn’t concat the spaces due to things like “they might have very different norms & baseline variances”. Maybe calculate each layer separately, then if they’re all similar average them together, otherwise keep separate and quote as separate numbers in your results
Yep, that’s the generalisation that would make most sense
The previous lines calculate the ratio (or 1-ratio) stored in the “explained variance” key for every sample/batch. Then in that later quoted line, the list is averaged, I.e. we”re taking the sample average over the ratio. That’s the FVU_B formula.
Let me know if this clears it up or if we’re misunderstanding each other!
Thanks! Fixed