Fabien Roger

Karma: 5,262

Automated Researchers Can Subtly Sandbag

gasteigerjo, Akbir Khan, Sam Bowman, Vlad Mikulik, Ethan Perez and Fabien Roger

Mar 26, 2025, 7:13 PM

44 points

0 comments4 min readLW link

(alignment.anthropic.com)

Fabien Roger Mar 24, 2025, 12:23 PM
LW: 2 AF: 2
0
AF
in reply to: Neel Nanda’s comment on: Fabien’s Shortform
They were separated by spaces. (But I’d encourage replication before updating too hard on results which I think are very weird.)

Fabien Roger Mar 23, 2025, 6:31 PM
LW: 4 AF: 3
0
AF
in reply to: Thane Ruthenis’s comment on: Fabien’s Shortform
Thanks for the example! I think it moderately convinced me that having CoT become hard for humans to understand in the next 2 years is slightly more plausible than I thought.
I still think that even for the things you described, it will be relatively easy for the base model to understand what is going on, and it’s likely that GPT-4o will too. See the discussion in another comment thread about “chicken-and-egg” problems when learning new encodings, which I think make it much easier to learn things like switching languages or skipping stopwords that take almost no additional effort to decode (therefore the it is incentivized to use the new encoding without the need to learn a new decoder). I see how once you get a model using “freestyling words” frequently the decoder becomes better over time at understanding freestyling in general (though it will be slow because it is trained using RL rather than SL), which allows you to eventually shift to encodings that the base model and GPT-4o don’t understand, but my guess is that this will take a long time (>the first 1e28 FLOP RL run?).

Fabien Roger Mar 23, 2025, 10:12 AM
27 points
4
in reply to: Sam Marks’s comment on: Sam Marks’s Shortform
I’ll add another one: in the training against the CoT monitor experiment, the model does not get a large constant negative reward for being caught (which is what you might naively do and which should intuitively work best). Instead, the reward is 1 if the tests pass and the CoT monitor says it is fine, and otherwise it is 0. This means the same reward of 0 for benign failures and attempts at cheating which get caught.
This means that for problems where the model knows it is extremely unlikely to succeed (e.g. when it tried a bunch and repeatedly failed), the model is not penalized for cheating, which means it should still cheat even if it has a high chance (e.g. 90%) of being caught by the CoT monitor.
With a different choice of reward where the model is penalized very hard for cheating even if it would have otherwise failed, it may cheat much less, and never explore into “cheat while bypassing the monitor”.

Fabien Roger Mar 22, 2025, 11:26 PM
2 points
0
in reply to: Mis-Understandings’s comment on: Fabien’s Shortform
A detail I forgot to mention and that might be important is that I was training the model to use one of several encodings (with a prefix specifying which one was used) to avoid this sort of implicit translation that goes through just changing embedding matrices. But that would have been a good sanity check to run to make sure that the model would learn to do the simple embedding matrix change when this is an optimal strategy (I would be very surprised if SGD was not able to find this simple transformation, what I claim is tricky is learning a new weird encoding while maintaining the other capabilities, though I am confused enough here that I don’t claim to make very accurate prediction in this domain).
The encoding was a lookup table.

Fabien Roger Mar 22, 2025, 10:11 PM
LW: 2 AF: 2
0
AF
in reply to: ryan_greenblatt’s comment on: Fabien’s Shortform
I think it’s much worse than that.
First, I think RL is more like 10,000x than 100x less efficient than SL (deepseek v3 probably can’t be compressed much below 10GB, while deepseek r1-zero stage can probably be compressed to 1MB of transcripts, despite both being roughly 1e24 FLOP).
Additionally, learning new languages is likely a much harder task than regular math RL, because you have chicken-and-egg issues (you are not incentivized to encode a new structure before knowing how to decode it, and you are not incentivized to decode it before you encode it). We have some empirical evidence of this: people struggle to make even simple steganography appear in RL setups which incentivize it the most. Maybe simple encodings can bootstrap to more complex encodings and this bypasses some of the chicken-and-egg issues, but I’m not sure. Chicken-and-egg problems also mean you benefit a lot from having many serial steps of RL when learning an encoding, and while RL can probably be scaled massively in parallel, high latency of generations imply there probably won’t be massive scale ups of the number of serial steps of RL compared to what r1 already did. (This is a hand-wavy argument, it might be incorrect.)

Fabien Roger Mar 22, 2025, 9:58 PM
LW: 8 AF: 5
2
AF
in reply to: Thane Ruthenis’s comment on: Fabien’s Shortform
My experiments are definitely not great at ruling out this sort of threat model. But I think they provide some evidence that LLMs are probably not great at manipulating generic non-pretraining encodings (the opposite result would have provided evidence that there might be encodings that are extremely easy to learn—I think my result (if they reproduce) do not reduce the variance between encodings, but they should shift the mean).
I agree there could in principle be much easier-to-learn encodings, but I don’t have one in mind and I don’t see a strong reason for any of them existing. What sorts of encoding do you expect to be natural to LLMs besides encodings already present in pretraining and that GPT-4o can decode? What would make a brand new encoding easy to learn? I’d update somewhat strongly in your direction if you exhibit an encoding that LLMs can easily learn in an unsupervised way and that is ~not present in pretraining.

Fabien Roger Mar 22, 2025, 9:50 PM
2 points
0
in reply to: Mis-Understandings’s comment on: Fabien’s Shortform
I also ran experiments where I just space-separated characters and base64 text, and LLMs had a much easier time with both of those, despite tokenization being roughly similar / worse. I think I even tried space-separated base64 which has a very cursed tokenization but was still easier to learn than the character-to-name map. I am still confused by why, in my experiments, LLMs struggled to speak using the character-to-name mapping. Maybe I did something wrong.

Fabien Roger Mar 22, 2025, 9:44 PM
1 point
0
in reply to: Bronson Schoen’s comment on: Fabien’s Shortform
If you train against monitors, you will likely bypass monitors no matter what because of jailbreaks.
I think using euphemisms is maybe the best general strategy, given how hard it is to do “pure encoded reasoning”. But I think it is a relatively weak strategy, and so this makes me hopeful that it’s hard to bypass human judgment even after training or to bypass LLM monitors zero-shot. This is because it should be possible to roughly follow the reasoning, and most worrying side tasks look nothing like the reasoning you are supposed to do.
But I could see it going either way, especially for GPT-4o, which is not that smart, and I think this is an ability we should ideally track. I’m curious if you know about examples of humans thinking out loud about one problem while solving another by hiding things in their out loud thinking, or if you think you could produce a math CoT which solves one random MATH problem while looking like it is solving another.

Fabien Roger Mar 22, 2025, 5:03 PM
4 points
0
in reply to: jake_mendel’s comment on: The case for unlearning that removes information from LLM weights
I think it’s somewhat unclear, probably a bit of both:
- Previous work has found that it’s surprisingly hard to poison pretraining with synthetic facts (https://arxiv.org/pdf/2410.13722)
- Previous work has found that it’s easy to revert the effects of fine-tuning (e.g. the toy experiments in https://arxiv.org/pdf/2311.12786 or https://arxiv.org/abs/2405.19550), though I think there is no experiment which directly compares pretraining and fine-tuning in the exact way we want.
  - Additionally, LLMs are pretty good at memorizing very “random” data, such as big-bench canaries or phone numbers, even when they only appear a few times in pretraining. So my guess is that the “fine-tuning” part is where most of the effect is.

Fabien Roger Mar 22, 2025, 10:48 AM
9 points
2
in reply to: Bronson Schoen’s comment on: Fabien’s Shortform
The threat model here is that the model uses the CoT in order to have much more complex schemes than what it could do in a forward pass. This is not about “you cannot predict the action from the CoT anymore”. More discussion of why I care about the former threat model here and here.
serve as an effective CoT only monitor
I’d specify for situations where doing something bad can’t be done in a single forward pass and requires CoT reasoning. This is harder to define / measure so I went with something simpler but which I think is a decent proxy. (In particular if the model does not verbalize that it is reward hacking but reward hacking is very easy to do without verbalizing it and the model would have done it even without a CoT, then this is not the threat model my comment is about.)

Fabien Roger Mar 22, 2025, 1:17 AM
LW: 4 AF: 3
0
AF
in reply to: ryan_greenblatt’s comment on: Fabien’s Shortform
Edited to say “small-scale SL and large-scale RL”

Fabien Roger Mar 22, 2025, 1:12 AM
LW: 4 AF: 3
2
AF
in reply to: ryan_greenblatt’s comment on: Fabien’s Shortform
I think it’s only somewhat more analogous: if you slowly transition to a new language, you don’t have incentives to have translations pairs with redundant content. But I agree one big difference with the monolingual corpus case is that you may have part 1 of the text in one language and part 2 in the other language, which could help a lot. I think even this sort of language learning is hard for small-scale SL and large-scale RL to learn.
(I also think that there is a more frightening version of the “new incomprehensible language” hypothesis: the case where you learn a new language to express thoughts which are hard to express in English. This is not analogous to any translation experiment, and I expect it to be brutal for small-scale SL and large-scale RL to learn a full language in which complex thoughts can be expressed.)

Fabien Roger Mar 22, 2025, 12:50 AM
LW: 55 AF: 27
−5
AF
on: Fabien’s Shortform
I ran quick experiments that make me think that it’s somewhat hard for LLMs to learn radically new encodings in an unsupervised way, and thus that LLMs probably won’t learn to speak new incomprehensible languages as a consequence of big r1-like RL in the next few years.
The experiments
I trained Llama 3-8B and some medium-size internal Anthropic models to speak using an encoding style that is very rare on the internet (e.g. map each letter to a random name, and join the names) with SFT on the encoded text and without providing translation pairs. I find that the resulting models:
- Have relatively high next-token-prediction losses
- Can’t speak well (e.g. even if I trained them on [encoded question --> encoded answer], the decoded answers are mostly gibberish).
This is not true if I use encodings that are frequently used on the internet (e.g. base64, random letter permutations, …), and this is less true if I add translation pairs to the training mix.
I think people overestimate how good LLMs are at speaking in codes because they usually just try encoding algorithms that are extremely common on the internet and that LLMs probably learned in part using translation pairs.
These experiments were done at a relatively small scale (80k sequences of ~50 tokens each, for 40 epochs, for a total of 160M tokens), and I could imagine things being different for large scale SFT or SFT with better base models. But given that RL doesn’t teach models that many new bits (you can distill r1 scratchpads using only a few MB of transcripts), I think this is still informative for large-scale RL.
Context
I ended up giving up on the broader projects as part of which I ran these experiments, but I think they are decently informative about the plausibility of LLMs learning to speak new incomprehensible languages.
I did not carefully run any of the experiments above, and I welcome replications.
Related observations from the literature
I think there is also a somewhat common misconception that R1-Zero scratchpads are hard to understand. The paper reports language switching, but that’s not hard to understand for multilingual humans. And the default scratchpads you get are usually plain English that is extremely easy to understand: see an example of scratchpad (sampled at T=1) here.
The scratchpads from OpenAI’s reward hacking paper are slightly more compact than regular English, but are using a kind of language that is not that far from English (and is probably not that far from what must be used on some corners of the internet at least as frequently as text encoded in random letter permutations).
Maybe the most relevant positive result in this field comes from Unsupervised translation (i.e. translation algorithms that work without pairs, only with 2 monolingual corpora). But this research is usually done on pairs of human languages, for which word-by-word translations offers an amazing bootstrapping platform which does not exist for the sort of more efficient encodings that would be needed for LLMs to learn truly hard to decipher languages. It also uses dedicated algorithms that are probably roughly as efficient as SFT on LLMs and probably much stronger than large scale RL for learning new languages.
More discussion
- Anecdotally, when LLMs speak other languages (e.g. French), it sometimes feels a little bit like they are speaking a translated version of English as opposed to thinking “natively” in French. This is very weak evidence because this could be false, or due to RLHF, and additionally it is not that interesting since French-to-English translations are probably easier than the translation from and to the sort of incomprehensible languages that may increase RL performance.
- Is it worrying if LLMs start using weird encodings from weird parts of the internet that small base models can’t understand? I think it’s somewhat worrying, but not as much as “de novo” incomprehensible languages because:
  - It means some humans can probably translate back and forth from this encoding quite easily, which means it should be feasible for people at labs to notice and learn this encoding
  - Light SFT on top of the strongest base models is likely to elicit a translator for the weird encoding (which could be much trickier if the language was spoken only by a scheming heavily RLed model)
- Here is a concrete prediction: Claude 3.5 Sonnet, GPT-4o and Gemini 2 Pro will be able to understand the topics considered in 99%+ of Chain-of-Thoughts of all regular Transformers trained in 2025 and 2026 that were not deliberately trained or prompted to be harder to understand (p=0.7).
What links here?
- Can LLMs learn Steganographic Reasoning via RL? by robert mccarthy (Apr 11, 2025, 4:33 PM; 20 points)

Fabien Roger Mar 19, 2025, 12:20 PM
LW: 2 AF: 2
0
AF
in reply to: Tom Lieberum’s comment on: Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases
I agree this is a concern.
My view here is that if you look at the paraphrases, it seems that a decent amount of standardization happened. Empirically, it also looks like T=1 sampling of the paraphraser introduces a random variations in the syntax. So I would be very surprised if a decent chunk of the syntactic link was not broken (e.g. at least 20% of it). Therefore I think it’s reasonable to bound the performance due to encoded reasoning to ~5x the gap between the original distillation and the paraphrased distillation, which would be ~7% (in absolute pass@1), which still seems not very frightening to me given that the no-CoT vs CoT gap is >50%.
This is sensitive to the size of the gap. If future work finds a ~5% gap on a new model, I’d be very concerned because it would make it more likely that the majority of the reasoning is encoded.
(Edit: This is a rough sketch of the argument, there are complications around diminishing returns on reasoning: maybe only a very small amount of visible reasoning is responsible for the majority of the no-CoT vs CoT gap. But it’s unclear how much this undermines my argument because I most care about the ability of the model to not leak anything about what it is reasoning about.)

Fabien Roger Mar 17, 2025, 2:31 PM
LW: 2 AF: 2
0
AF
in reply to: Tom Lieberum’s comment on: Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases
1. Both (the final answers are usually very short compared to the CoT)
2. The 1-sigma uncertainty over the mean is sigma = sqrt(acc * (1-acc) / n) ~ 1%. But I think the results are slightly more significant than this suggests as I evaluate all models on the exact same set of questions + given that chunk-paraphrased is worse across experiments on multiple models, I would guess that the same effect would exist for a much larger eval.
3. Good catch, I think I wrote that based on earlier results. I’ll edit the post.
On your notes:
If paraphrasing was a bijection between semantically-equivalent sentences, then I agree you could learn the structure in the bijection space. Is that what you were pointing at? My hope is that paraphrasing removes the mutual information through standardization / randomization (where what is random is not the map, but the actual samples because I am sampling at temperature 1). For example, if enumerated list ⇒ equation and bulleted list ⇒ sth else, then swapping enumerated and bulleted would not help but having the model use always a bulleted list or use a bulleted list a random 50% of the time does help to break the enumerated list ⇔ equation link.
I agree that it’s unclear where paraphrasing is on the [bijection]-[no syntax left unrandomized] spectrum.

Fabien Roger Mar 14, 2025, 10:00 AM
LW: 9 AF: 4
2
AF
in reply to: Daniel Kokotajlo’s comment on: Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases
I see 2 issues with the experiment you suggested:
- The chunking (Where do you stop to paraphrase? I think to have something meaningful to paraphrase you want something like every “big paragraph” (but maybe not each equation, which could each be a line). Idk how to do that cleanly without doing the distillation. And not doing it cleanly might result in benign performance degradation.)
- The OODness, but I guess you’re fine with this one? I suspect this might get you worrisome results for no worrying reason, just because it’s somewhat degenerate to prefill a scratchpad with a big paraphrase, get a new paragraph which is probably influence by the paraphrasing style already, then paraphrase that. Besides the paraphrasing style being distracting, there is also a “paraphrase of paraphrase of paraphrase of …” effect that might degrade the scratchpad for “normal” reasons but in ways that are hard to pull apart from the worrying reasons.

Auditing language models for hidden objectives

Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M and evhub

Mar 13, 2025, 7:18 PM

141 points

15 comments13 min readLW link

Fabien Roger Mar 13, 2025, 2:26 PM
LW: 5 AF: 4
0
AF
in reply to: Daniel Kokotajlo’s comment on: Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases
How happy would you be about the following experiment:
1. Cut scratchpads in chunks for N paragraphs each
2. Paraphrase the first k chunks (where k is drawn at random), wrap them in <paraphrased_chunk> or <chunk> tags and join them
3. Distill that
4. Do the inference time experiment (asking for a chunk by pre-filling <chunk>, stop when you see </chunk>, paraphrase it and put it in <paraphrased_chunk> tags)
This way the inference-time experiment is not OOD and should not reduce performance. This has roughly the same drawbacks as the experiment I described in this post though (i.e. risk of “learning the paraphrased version of the encoded reasoning scheme”, not testing the OOD-generalization, …), so I am not sure if this is valuable.
Another experiment:
Cut scratchpads in chunks for N paragraphs each. Wrap them in <chunk> tags and distill them (without paraphrasing). Then do the inference time experiment. This gets rid of the FUD around chunking reducing performance (which I observed, as described in the footnote). This could still result in degraded performance just because it’s OOD, but arguably it’s slightly worrisome if being exactly IID matters for scratchpads.
I am also curious if you have takes about what value of N (or what chunking scheme) would make the most sense.

Fabien Roger 11 Mar 2025 23:54 UTC
LW: 4 AF: 4
0
AF
in reply to: kave’s comment on: Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases
By distillation, I mean training to imitate. So in the distill-from-paraphrased setting, the only model involved at evaluation time is the base model fine-tuned on paraphrased scratchpads, and it generates an answer from beginning to end.

Fabien Roger

Au­to­mated Re­searchers Can Subtly Sandbag

Au­dit­ing lan­guage mod­els for hid­den objectives

Automated Researchers Can Subtly Sandbag

Auditing language models for hidden objectives