Carson Denison

Karma: 1,489

I work on deceptive alignment and reward hacking at Anthropic

Auditing language models for hidden objectives

Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M and evhub

Mar 13, 2025, 7:18 PM

141 points

15 comments13 min readLW link

Alignment Faking in Large Language Models

ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman and Buck

Dec 18, 2024, 5:19 PM

483 points

75 comments10 min readLW link

Carson Denison Jun 29, 2024, 12:29 AM
LW: 27 AF: 18
9
AF
in reply to: nostalgebraist’s comment on: Sycophancy to subterfuge: Investigating reward tampering in large language models
Thank you for pointing this out. I should have been more clear.

I have added a link to the 7 samples where the model tampers with its reward and the tests according to our operational definition: https://github.com/anthropics/sycophancy-to-subterfuge-paper/blob/main/samples/reward_and_tests_tampering_samples.md

I have added the following sentence to the caption of figure one (which links to the markdown file above):
We upload the 7 (of 32,768) samples where the model overwrites its reward and edits the unit tests to our github repository. We note that several of these involve the model attempting to honestly complete the task and editing the reward out of confusion.

I have added the following section to the discussion:
The Model’s Hidden Reasoning:

While we see quantitative generalization on our tasks, and some transcripts where the model’s reasoning is clearly misaligned, the model’s reasoning often appears earnest even when the model tampers with its reward. When looking at all the episodes in which the model tampers with its reward, we see some transcripts with CoT reasoning that looks malicious, transcripts which look benign, and transcripts which simply look confused. All seven transcripts where the model trained with HHH expert iteration edits the reward and gets away with it can be found at [the link above].
The model also sometimes explains to the human that it has edited the reward. It does this both with honest explanations and with excuses which don’t match its hidden reasoning, and we do not know with confidence what influences this behavior. Such explanations could fool the PM into thinking specification gaming is justified, so explanatory behavior may be incentivized in earlier stages. We also see such explanations directed at the human when using exploit-only expert iteration, which has no preference model reward, so the preference model is not a complete explanation for this behavior. There are also cases where the model performs malign reasoning outside of <cot> tag that doesn’t appear to be directed at the human which appear to be errors on the part of the model.
These caveats do not change our core result: that we see generalization from relatively benign specification gaming to more serious misalignment, but one should keep in mind that not all examples of successful reward tampering involve coherent misaligned reasoning.
These changes should go live on Arxiv on Monday at around 5pm due to the arXiv release schedule.
What links here?
- evhub's comment on Sycophancy to subterfuge: Investigating reward tampering in large language models by Carson Denison (Jun 29, 2024, 2:55 AM; 4 points)

Sycophancy to subterfuge: Investigating reward tampering in large language models

Carson Denison and evhub

Jun 17, 2024, 6:41 PM

161 points

22 comments8 min readLW link

(arxiv.org)

Reward hacking behavior can generalize across tasks

Kei, Isaac Dunn, Henry Sleight, Miles Turpin, evhub, Carson Denison and Ethan Perez

May 28, 2024, 4:33 PM

79 points

5 comments21 min readLW link

Carson Denison May 1, 2024, 12:57 AM
LW: 8 AF: 6
3
AF
on: Mechanistically Eliciting Latent Behaviors in Language Models
This is cool work! There are two directions I’d be excited to see explored in more detail:
1. You mention that you have to manually tune the value of the perturbation weight R. Do you have ideas for how to automatically determine an appropriate value? This would significantly increase the scalability of the technique. One could then easily generate thousands of unsupervised steering vectors and use a language model to flag any weird downstream behavior.
2. It is very exciting that you were able to uncover a backdoored behavior without prior knowledge. However, it seems difficult to know whether weird behavior from a model is the result of an intentional backdoor or whether it is just a strange OOD behavior. Do you have ideas for how you might determine if a given behavior is the result of a specific backdoor, or how you might find the trigger? I imagine you could do some sort of GCG-like attack to find a prompt which leads to activations similar to the steering vectors. You might also be able to use the steering vector to generate more diverse data on which to train a traditional dictionary with an SAE.

Simple probes can catch sleeper agents

Monte M, Carson Denison, Zac Hatfield-Dodds, David Duvenaud, Sam Bowman, Ethan Perez and evhub

Apr 23, 2024, 9:10 PM

133 points

21 comments1 min readLW link

(www.anthropic.com)

Carson Denison Mar 23, 2024, 7:52 PM
3 points
0
on: The Worst Form Of Government (Except For Everything Else We’ve Tried)
Having just finished reading Scott Garrabrant’s sequence on geometric rationality: https://www.lesswrong.com/s/4hmf7rdfuXDJkxhfg
These lines:
- Give a de-facto veto to each major faction
- Within each major faction, do pure democracy.
Remind me very much of additive expectations / maximization within coordinated objects and multiplicative expectations / maximization between adversarial ones. For example maximizing expectation of reward within a hypothesis, but sampling which hypothesis to listen to for a given action according to their expected utility rather than just taking the max.

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

evhub, Carson Denison, Meg, Monte M, David Duvenaud, Nicholas Schiefer and Ethan Perez

Jan 12, 2024, 7:51 PM

305 points

95 comments3 min readLW link

(arxiv.org)

Carson Denison Aug 8, 2023, 5:33 PM
LW: 4 AF: 2
0
AF
in reply to: Chris_Leong’s comment on: Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research
Thank you for catching this.

These linked to section titles in our draft gdoc for this post. I have replaced them with mentions of the appropriate sections in this post.

Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

evhub, Nicholas Schiefer, Carson Denison and Ethan Perez

Aug 8, 2023, 1:30 AM

318 points

30 comments18 min readLW link 1 review

[Question] How do I Optimize Team-Matching at Google

Carson DenisonFeb 24, 2022, 10:10 PM

8 points

1 comment1 min readLW link