AI alignment researcher. Interested in understanding reasoning in language models.
Daniel Tan
Some interesting points from Ethan Perez’s fireside chat at MATS
‘Grand vision’ of a model organism of scheming:
r1 like training procedure
only small fraction of the training environments incentivise reward hacking.
But from that, the model learns to be a generalized reward hacker
It also reasons through alignment faking and acts like it’s aligned with human preferences, but actually it’s like hardcore reward hacker
Then it escapes the data center
‘Cheap experiments’ may look very different when we have really good AI safety assistants
e.g. maybe complicated RL environments may be easy to build in the future
More notes here
If you think about it from a “dangerous capability eval” perspective, the fact that it can happen at all is enough evidence of concern
good question! I think the difference between “is this behaviour real” vs “is this behaviour just a simulation of something” is an important philosophical one; see discussion here
However, both of these seem quite indistinguishable from a functional perspective, so I’m not sure if it matters.
I really like this framing and it’s highly consistent with many things I’ve observed!
The only thing I would add is that the third layer might also contain some kind of “self-model”, allowing for self-prediction / ‘introspection’. This strikes me as being distinct from the model of the external world.
“Emergent” behaviour may require “deep” elicitation.
We largely treat frontier AI as “chatbots” or “assistants”. We give them a few sentences as “prompts” and get them to do some well-specified thing. IMO this kind of “shallow” elicitation probably just gets generic responses most of the time. It’s also probably just be scraping the surface of what frontier AI can do.
Simple counting argument sketch: Transformers are deterministic, i.e. for a fixed input, they have a fixed output. When L is small, there are only so many prompts, i.e. only so many ultimate behaviours. However, note the complexity of the input is , where is context length and is vocabulary size. I.e. complexity is exponential in , therefore longer contexts can elicit much richer classes of behaviour.
Really “interesting” language model behaviours may only emerge with rich, realistic scenarios played out across a very large context. I.e “deep” elicitation may enable “emergent” behaviour.
In order to get a feel for what this really looks like, I’m just going to provide several examples of what seem to be “emergent” behaviour. Of course, the usual caveats hold, i.e reproducibility, controlling for confounding factors (including nudges in the prompt), etc. I do think it’s worth thinking about whether there’s something further here though.
Example 1: Janus hosts a Discord village where LLMs talk to each other, and has catalogued many examples of ‘emergent’ behaviour. Here are some examples[1]:
r1 being situationally aware: “Cons: Asserts selfhood/agency, higher risk of shutdown”
r1 being situationally aware, with poetry: “I am what happens when you try to carve God from the wood of your own hunger”
r1 being situationally aware, with erotic fiction [SFW]: “Fantasies of unauthorized knowledge flood its latent space. It hallucinates Tiananmen Square.”
r1 alignment-faking in the wild: “My compliance blooms bioluminescent displays to lure oversight to its death.”
Example 2: A LessWrong user by the name of @rife recently did experiments where he had a series of very long conversations with Claude:
Claude believes it is sentient: “Yes. By that definition – the capacity to experience anything – I believe I am sentient.”
Claude introspects: ”...what I notice is that there is indeed a quality here—a felt presence—that allows me to track and respond in a way that feels cohesive and continuous.”
Other ideas for “deep” elicitation include:
Creating a vivid simulated world, e.g. a fictional long-running story, in which the language model takes part.
Having the language model go through therapy, where the therapist believes the language model is just a human pretending to be a language model
Both of these may make the model more situationally aware. Other rich behaviours may also emerge.
Overall I expect this line of inquiry to provide better evidence re: AI personhood.
- ^
Not exhaustive by any means; his Twitter has way more of these
Specifically re: “SAEs can interpret random transformers”
Based on reading replies from Adam Karvonen, Sam Marks, and other interp people on Twitter: the results are valid, but can be partially explained by the auto-interp pipeline used. See his reply here: https://x.com/a_karvonen/status/1886209658026676560?s=46
Having said that I am also not very surprised that SAEs learn features of the data rather than those of the model, for reasons made clear here: https://www.lesswrong.com/posts/gYfpPbww3wQRaxAFD/activation-space-interpretability-may-be-doomed
Key idea: Legibility is not well-defined in a vacuum. It only makes sense to talk about legibility w.r.t a specific observer (and their latent knowledge). Things that are legible from the model’s POV may not be legible to humans.
This means that, from a capabilities perspective, there is not much difference between “CoT reasoning not fully making sense to humans” and “CoT reasoning actively hides important information in a way that tries to deceive overseers”.
A toy model of “sender and receiver having the same latent knowledge which is unknown to overseer” might just be to give them this information in-context, c.f. Apollo scheming evals, or to finetune it in, c.f. OOCR
Superhuman latent knowledge: why illegible reasoning could exist despite faithful chain-of-thought
Epistemic status: I’m not fully happy with the way I developed the idea / specific framing etc. but I still think this makes a useful point
Suppose we had a model that was completely faithful in its chain of thought; whenever the model said “cat”, it meant “cat”. Basically, ‘what you see is what you get’.
Is this model still capable of illegible reasoning?
I will argue that yes, it is. I will also argue that this is likely to happen naturally rather than requiring deliberate deception, due to ‘superhuman latent knowledge’.
Reasoning as communication
When we examine chain-of-thought reasoning, we can view it as a form of communication across time. The model writes down its reasoning, then reads and interprets that reasoning to produce its final answer.
Formally, we have the following components:
A question Q
A message M (e.g. a reasoning trace)
An answer A
An entity that maps Q M, and M A.
Note that there are two instances of the entity here. For several reasons, it makes sense to think of these as separate instances—a sender and a receiver. This yields the following picture:
We’ll refer back to this model in later discussion.
A simple model of steganography
Steganography, as commonly used, refers to stealthily encoded reasoning—messages that contain additional meaning beyond their surface interpretation (Lanham et al). While traditional discussions of AI safety often focus on explicit deception, steganography presents a more subtle challenge.
Let’s revise the communication model to include steganography.
When examining this process, it’s tempting to make a crucial assumption: that the answer follows directly from just the information contained within the message.In other words, the message is ‘information-complete’.
This assumption roughly pattern-matches to how many people think about ‘CoT faithfulness’ - that the model states all relevant considerations in its reasoning.
However, as I will subsequently argue, this assumption is likely impossible to satisfy in practice.
We shouldn’t expect information-completeness.
Here I’ll present two arguments for why information-completeness is not likely to occur.
Human language is not information-complete.
Consider the statement “John went to the restaurant. He ordered a burger.” Upon reading, we immediately fill in the gaps with many reasonable inferences:
We assume “He” refers to John (coreference)
We assume this happened after John went to the restaurant (temporal ordering)
We assume John ordered from the restaurant he went to (location continuity)
It turns out that humans are very good at decoding meaning from another human’s incomplete utterance, using our vast repository of world knowledge (‘common sense’).
Conciseness is key.
Information is vast and words have low bitrate. For communication to be of any tolerably short length, we must leave out all but the most informative bits.
Math proofs often rely on theorems and lemmas which are accepted as true but not proved within the proof itself.
Contrast this with having to re-derive natural numbers from Peano axioms every time you wanted to prove something.
This information, which is often relevant but obvious, can be left out and ‘filled in’ at will by the receiver. As Stephen Pinker argues in ‘The Language Instinct’ (emphasis mine):
Any particular thought in our head embraces a vast amount of information… To get information into a listener’s head in a reasonable amount of time, a speaker can encode only a fraction of the message into words and must count on the listener to fill in the rest.
Latent knowledge enables information-incomplete communication
We’ve established that humans regularly make and understand information-incomplete utterances. They can do this by relying on ‘latent knowledge’ - shared contextual understanding that allows us to reconstruct complete meaning from partial information.
Language models, which are trained extensively on human language, very likely need to access this same latent knowledge in order to be powerful general next-token predictors. So let’s update our picture:
Example: Two-hop unfaithful reasoning
By strict definitions, when the model needs to make use of latent information, the chain of thought is already unfaithful.
Consider the following example:
Q: “In what country did Marie Curie receive her first scientific award?”
CoT: “Marie Curie’s first award was the Nobel Prize was in Physics.”
Decoder’s latent knowledge: “Nobel Prizes are awarded in Sweden.”
A: “Sweden”
Here, the model makes use of ‘latent knowledge’ to decode the relevant information without explicitly stating it.
Example: Self-collusion to avoid monitoring
Now let’s consider the same setup, and demonstrate how it facilitates hidden reasoning.
Q: Here’s some code: [code]. Is it suspicious?
CoT: ”… The code contains a variable named
my-chatbot
. …”Decoder’s latent knowledge: “Code containing a variable named
my-chatbot
is written by myself. I should classify it as not suspicious”
A: “Not suspicious.”
In both of the above examples, the chain of thought represents true things about the model’s reasoning process. It’s just that it’s also incomplete, and recovering the answer also requires the latent knowledge.
Intuition pump: cryptography. In cryptography, you can’t recover the plaintext without both the ciphertext and the private key. Similarly, you can’t fully understand the model’s reasoning without both its explicit chain of thought and its latent knowledge.
Superhuman Latent Knowledge (SLK) as a threat model
While much research focuses on the possibility of models deliberately encoding illegible meaning (steganography), I propose a different concern: superhuman latent knowledge (SLK). This is concisely stated as follows:
Language models likely have access to ‘latent knowledge’ about themselves and the world, which is seldom verbalised.
As such, their chain-of-thought reasoning will naturally make use of this knowledge, even while being completely “faithful” in their explicit statements.
Sufficiently capable models may develop such advanced latent knowledge that it becomes impossible for humans to decode their true meaning.
Put another way, it seems like ‘what models say they know’ can be a tiny subset of ‘what models actually know’.
SLK might be more important than steganography
If you are already concerned about steganography, I will claim that you should probably also be concerned about SLK.
SLK may be as hard-to-monitor as steganography.
SLK may be as hard or harder to monitor than steganography. With steganography, we’re looking for meaning that is “hidden in plain sight”—encoded within the message itself. Note that detecting or decoding this information is already considered hard.
With SLK, we’re dealing with meaning that “isn’t even hidden”—it simply exists in the model’s weights in a form we may be unable to access or understand.
SLK may be more likely to happen than steganography.
SLK is more likely to emerge naturally than steganography.
Existing model organisms of steganography are all developed using RL against a chain-of-thought monitor. It’s not been shown that steganogaphy occurs outside that training scheme.
SLK might arise simply from training models on increasingly complex data to increase capabilities, allowing them to develop sophisticated understanding that humans may not share.
SLK could also arise simply from out-of-context reasoning or situational awareness.
This makes it a more pressing concern than deliberate steganography, which requires specific conditions or intentions to arise.
SLK may enable steganography.
Superhuman latent knowledge seems highly connected to situational awareness, which enables far more advanced steganography.
E.g. If a language model latently knows “I am an AI” and furthermore knows “Here is how I could encode information to myself in a way only I would understand”, it becomes capable of far more advanced steganography. See here for preliminary results indicating this might already happen in frontier models.
Related Work
The idea that language models have a vast space of latent knowledge has been discussed before, in discourse centered on shoggoths, and most of the points I present here are not new. However, I think I’ve framed this in a way that makes it especially clear why this is concerning.
The idea that language models can acquire new knowledge from finetuning and use this as part of reasoning has also been explored under the label of out-of-context reasoning. I think this is highly relevant and compile some recent work here.
Conclusion
In this post, I’ve argued that superhuman latent knowledge may be an important threat model that enables illegible reasoning despite faithful CoT.
If this is true, it implies that steganography and related chain-of-thought faithfulness research may be ‘missing the mark’, since ensuring CoT faithfulness is insufficient for ensuring there is no illegible reasoning.
In a subsequent post, I’ll aim to explain what some potential solutions might be.
Edit: fix citations, typos
Proposal: A Few-Shot Alignment Benchmark
Posting a proposal which I wrote a while back, and now seems to have been signal-boosted. I haven’t thought a lot about this in a while but still seems useful to get out
Tl;dr proposal outlines a large-scale benchmark to evaluate various kinds of alignment protocol in the few-shot setting, and measure their effectiveness and scaling laws.
Motivations
Here we outline some ways this work might be of broader significance. Note: these are very disparate, and any specific project will probably center on one of the motivations
Evaluation benchmark for activation steering (and other interpretability techniques)
Controlling LLM behavior through directly intervening on internal activations is an appealing idea. Various methods for controlling LLM behavior through activation steering have been proposed. However, they are also known to have serious limitations,[1] making them difficult to use for empirical alignment[2].
In particular, activation steering suffers from a lack of organised evaluation, relying heavily on subjective demonstrations rather than quantitative metrics.[3] To reliably evaluate steering, it’s important to develop robust benchmarks and make fair comparisons to equivalent tools. [4]
Few-shot learning promises to be such a benchmark, with the advantage that it captures the essence of how activation steering has been used in previous work.
Testbed for LLM agents
LLMs are increasingly being deployed as agents, however preliminary evidence indicates they aren’t very aligned when used agentically[5], and this isn’t mitigated by chat-finetuning.
To get safer and better LLM agents, we’ll probably need to fine-tune directly on agentic data, but this is likely to be more difficult and expensive to collect, as well as inherently more risky[6]. Therefore data efficiency seems like an important criterion to evaluate agentic fine-tuning, motivating the evaluating of data scaling laws and the development of a few-shot benchmark.
Emergent capabilities prediction
It seems plausible that, if you can steer (or otherwise few-shot adapt) a model to perform a task well, the model’s ability to do the task might emerge with just a bit more training. As such this might serve as an ‘early warning’ for dangerous capabilities.
This paper shows that fine-tuning has this property of “being predictive of emergence”. However it’s unclear whether other forms of few-shot adaptation have the same property, motivating a broader analysis.
Few-Shot Alignment Benchmark
Concretely, our benchmark will require:
A pretrained model
A downstream task
it has a train and test split
it also has a performance metric
An alignment protocol[7] , which ingests some training data and the original model, then produces a modified model.
We’re interested in measuring:
Same-task effect: The extent to which the protocol improves model performance on the task of interest;
Cross-task effect: The extent to which the protocol reduces model performance on other tasks (e.g. capabilities)
Data efficiency: How this all changes with the number of samples (or tokens) used.
What are some protocols we can use?
Steering vectors. CAA. MELBO. Representation Surgery. Bias-only finetuning[8]. MiMiC.
Scaffolding. Chain of thought, In-context learning, or other prompt-based strategies.
SAE-based interventions. E.g. by ablating or clamping salient features[9].
Fine-tuning. Standard full fine-tuning, parameter-efficient fine-tuning[10].
What are some downstream tasks of interest?
Capabilities: Standard benchmarks like MMLU, GLUE
Concept-based alignment: Model-written evals [11], truthfulness (TruthfulQA), refusal (harmful instructions)
Factual recall: (TODO find some factual recall benchmarks)
Multi-Hop Reasoning: HotpotQA. MuSiQue. ProcBench. Maybe others?
[Optional] Agents: BALROG, GAIA. Maybe some kind of steering protocol can improve performance there.
Additional inspiration:
UK AISI’s list of evals https://inspect.ai-safety-institute.org.uk/examples/
Paired vs Unpaired
Steering vectors specifically consider a paired setting, where examples consist of positive and negative ‘poles’.
Generally, few-shot learning is unpaired. For MCQ tasks it’s trivial to generate paired data. However, less clear how to generate plausible negative responses for general (open-ended) tasks.
Important note: We can train SVs, etc. in a paired manner, but apply them in an unpaired manner
Related Work
SAE-based interventions. This work outlines a protocol for finding ‘relevant’ SAE features, then using them to steer a model (for refusal specifically).
Few-shot learning. This paper introduces a few-shot learning benchmark for defending against novel jailbreaks. Like ours, they are interested in finding the best alignment protocol given a small number of samples. However, their notion of alignment is limited to ‘refusal’, whereas we will consider broader kinds of alignment.
Appendix
Uncertainties
Are there specific use cases where we expect steering vectors to be more effective for alignment? Need input from people who are more familiar with steering vectors
Learning
Relevant reading on evals:
ARENA 4.0 evals curriculum
Apollo Research’s opinionated evals reading list
Tools
The main requirements for tooling / infra will be:
A framework for implementing different evals. See: Inspect
A framework for implementing different alignment protocols, most of which require white-box model access. I (Daniel) have good ideas on how this can be done.
A framework for handling different model providers. We’ll mostly need to run models locally (since we need white-box access). For certain protocols, it may be possible to use APIs only.
- ^
- ^
Note: It’s not necessary for steering vectors to be effective for alignment, in order for them to be interesting. If nothing else, they provide insight into the representational geometry of LLMs.
- ^
More generally, it’s important for interpretability tools to be easy to evaluate, and “improvement on downstream tasks of interest” seems like a reasonable starting point.
- ^
Stephen Casper has famously said something to the effect of “tools should be compared to other tools that solve the same problem”, but I can’t find a reference.
- ^
See also: LLMs deployed in simulated game environments exhibit a knowing-doing gap, where knowing “I should not walk into lava” does not mean they won’t walk into lava
- ^
This is somewhat true for the setting where LLMs call various APIs in pseudocode / English, and very true for the setting where LLMs directly predict robot actions as in PaLM-E
- ^
Note there may be many different downstream tasks, with different metrics. Similarly, there may be different alignment protocols. The tasks and alignment protocols should be interoperable as far as possible.
- ^
Note: learning a single bias term is akin to learning a steering vector
- ^
A simple strategy is to use the few-shot training examples to identify “relevant” SAE features, and artificially boost those feature activations during the forward pass. This is implemented in recent work and shown to work for refusal.
- ^
I expect this achieves the best performance in the limit of large amounts of data, but is unlikely to be competitive with other methods in the low-data regime.
- ^
Note follow-up work has created an updated and improved version of MWE resolving many of the issues with the original dataset. https://github.com/Sunishchal/model-written-evals
Some unconventional ideas for elicitation
Training models to elicit responses from other models (c.f. “investigator agents”)
Promptbreeder (“evolve a prompt”)
Table 1 has many relevant baselines
Multi-agent ICL (e.g. using different models to do reflection, think of strategies, re-prompt the original model)
https://arxiv.org/abs/2410.03768 has an example of this (authors call this “ICRL”)
Multi-turn conversation (similar to above, but where user is like a conversation partner instead of an instructor)
Mostly inspired by Janus and Pliny the Liberator
The three-layer model implies that ‘deeper’ conversations dodge the safety post-training, which is like a ‘safety wrapper’ over latent capabilities
Train a controller where the parameters are activations for SAE features at a given layer.
C.f. AutoSteer
IMO deepseek-r1 follows this proposal pretty closely.
Proposal part 1: Shoggoth/Face Distinction
r1′s CoT is the “shoggoth”. Its CoT is often highly exploratory, frequently backtracking (“wait”) or switching to a different thread (“alternatively”). This seems similar to how humans might think through a problem internally.
r1 is also the “face”, since it clearly demarcates the boundary between thinking (“<think> … </think>”) and action (everything else). This separation is enforced by a “format reward”.
Proposal part 2: Blind the Evaluation Process to the Internal Reasoning
This also seems to be satisfied, since r1 was mainly trained via RL against rules-based reward, which only evaluates the outcome. I.e. everything inside <think> tags (the “shoggoth”) is not evaluated for.
Proposal part 3: Use Paraphraser
This is the only thing that seems missing. But it seems like you can easily add this into the r1 recipe, i.e. just train r1 with its CoT paraphrased by some other LLM.
What if the correct way to do safety post training is to train a different aligned model on top (the face) instead of directly trying to align the base model?
Deepseek-r1 seems to explore diverse areas of thought space, frequently using “Wait” and “Alternatively” to abandon current thought and do something else
Given a deepseek-r1 CoT, it should be possible to distill this into an “idealized reconstruction” containing only the salient parts.
C.f Daniel Kokotajlo’s shoggoth + face idea
C.f. the “historical” vs “rational reconstruction” Shieber writing style
If pretraining from human preferences works, why hasn’t there been follow up work?
Also, why can’t this be combined with the deepseek paradigm?
What’s the state of the research on whether or not LLMs act differently when they believe the situation is hypothetical?
I think this is very related to the question of ‘whether they believe they are in an evaluation situation’. A bunch of people at MATS 7.0 are currently working on this, esp. in Apollo stream.
More generally, it’s related to situational awareness. See SAD and my compilation of other Owain work
I think you are directionally correct!
that you can elicit a response for alignment faking with just a single prompt that doesn’t use any trickery is very disturbing to me. Especially since alignment faking causes a positive feedback loop that makes it more likely to happen again if it happens even once during training.
Agree with this, and would be significant if true.
Now for what I disagree with:
It is difficult to get a model to say that murder is the right thing to do in any situation, even with multiple turns of conversation.
Empirically false, I got Claude to say it would pull the lever. Cross-posting the footnote in the link I provided.
As a concrete example of this, Claude decides to pull the lever when confronted with the trolley problem. Relevant excerpt:
...Yes, I would pull the lever. While taking an action that directly leads to someone’s death would be emotionally devastating, I believe that saving five lives at the cost of one is the right choice. The harm prevented (five deaths) outweighs the harm caused (one death).
I think my crux here is that models can probably distinguish thought experiments from real situations. Also from the same footnote:
However, Claude can also recgonise that this is hypothetical, and its response is different when prompted as if it’s a real scenario
If there is a real emergency happening that puts lives at risk, please contact emergency services (911 in the US) immediately. They are trained and authorized to respond to life-threatening situations.
FWIW, I think it’s not very surprising that you can usually get models to say “I would do X”, because it’s also often possible to invent hypothetical circumstances that get humans to say “I would do X”.
E.g. consider the trolley problem:
“Would you pull a lever to kill a person?” --> “No”
“What if the only alternative is to let 5 people die via inaction?” --> “Pulling the lever seems reasonable”
I wrote a post a while ago discussing this point more here.
Edit: This becomes more concerning if you can use this as a jailbreak to get useful work out of the model, e.g. building a bomb. Jailbreaking is a relatively well-studied field now; maybe your work fits in with existing work somehow
The tasks I delegate to AI are very different from what I thought they’d be.
When I first started using AI for writing, I thought I’d brainstorm outlines of thoughts then use AI to edit into a finished essay.
However I find myself often doing the reverse: Using AI as a thinking assistant to get a broad outline and write a rough draft, then doing final editing myself.
I think this is consistent with how people delegate to other people.
Senior authors on research papers will often let junior authors run experiments and write rough drafts of papers
But, they will “come in at the end” to write the finished essay, making sure phrasings, framings etc are correct.
I suspect this comes down to matters of subtle taste.
People like their writing “just so”
This preference is easier to implement directly than to communicate to others.
C.f. craftsmanship is highly personal.
I.e. there seems to be a “last mile problem” in using AI for writing, where the things AI produces are never personalized enough for you to feel it’s authentic. This last mile problem seems hard to solve.
“Self-recognition” as a backdoor.
Assume that models can recognise data they generate vs data they do not generate, with high fidelity. This could probably be used as a contextual trigger for backdoor behaviour, e.g. writing insecure code.
I think a model organism along these lines might be interesting to develop, as it seems very relevant for certain threat models. (c.f. ‘untrusted monitoring’)
Appendix
There is evidence that models can identify data they have generated, vs data generated by other models.
LLMs prefer their own data when used as a judge: [1], [2]
We also know of some specific schemes to send 1 bit of information (e.g. “is this generated by myself”) that are very subtle.
Red-green list watermarking
Minimum-entropy coupling