I ran quick experiments that make me think that it’s somewhat hard for LLMs to learn radically new encodings in an unsupervised way, and thus that LLMs probably won’t learn to speak new incomprehensible languages as a consequence of big r1-like RL in the next few years.
The experiments
I trained Llama 3-8B and some medium-size internal Anthropic models to speak using an encoding style that is very rare on the internet (e.g. map each letter to a random name, and join the names) with SFT on the encoded text and without providing translation pairs. I find that the resulting models:
Have relatively high next-token-prediction losses
Can’t speak well (e.g. even if I trained them on [encoded question --> encoded answer], the decoded answers are mostly gibberish).
This is not true if I use encodings that are frequently used on the internet (e.g. base64, random letter permutations, …), and this is less true if I add translation pairs to the training mix.
I think people overestimate how good LLMs are at speaking in codes because they usually just try encoding algorithms that are extremely common on the internet and that LLMs probably learned in part using translation pairs.
These experiments were done at a relatively small scale (80k sequences of ~50 tokens each, for 40 epochs, for a total of 160M tokens), and I could imagine things being different for large scale SFT or SFT with better base models. But given that RL doesn’t teach models that many new bits (you can distill r1 scratchpads using only a few MB of transcripts), I think this is still informative for large-scale RL.
Context
I ended up giving up on the broader projects as part of which I ran these experiments, but I think they are decently informative about the plausibility of LLMs learning to speak new incomprehensible languages.
I did not carefully run any of the experiments above, and I welcome replications.
Related observations from the literature
I think there is also a somewhat common misconception that R1-Zero scratchpads are hard to understand. The paper reports language switching, but that’s not hard to understand for multilingual humans. And the default scratchpads you get are usually plain English that is extremely easy to understand: see an example of scratchpad (sampled at T=1) here.
The scratchpads from OpenAI’s reward hacking paper are slightly more compact than regular English, but are using a kind of language that is not that far from English (and is probably not that far from what must be used on some corners of the internet at least as frequently as text encoded in random letter permutations).
Maybe the most relevant positive result in this field comes from Unsupervisedtranslation (i.e. translation algorithms that work without pairs, only with 2 monolingual corpora). But this research is usually done on pairs of human languages, for which word-by-word translations offers an amazing bootstrapping platform which does not exist for the sort of more efficient encodings that would be needed for LLMs to learn truly hard to decipher languages. It also uses dedicated algorithms that are probably roughly as efficient as SFT on LLMs and probably much stronger than large scale RL for learning new languages.
More discussion
Anecdotally, when LLMs speak other languages (e.g. French), it sometimes feels a little bit like they are speaking a translated version of English as opposed to thinking “natively” in French. This is very weak evidence because this could be false, or due to RLHF, and additionally it is not that interesting since French-to-English translations are probably easier than the translation from and to the sort of incomprehensible languages that may increase RL performance.
Is it worrying if LLMs start using weird encodings from weird parts of the internet that small base models can’t understand? I think it’s somewhat worrying, but not as much as “de novo” incomprehensible languages because:
It means some humans can probably translate back and forth from this encoding quite easily, which means it should be feasible for people at labs to notice and learn this encoding
Light SFT on top of the strongest base models is likely to elicit a translator for the weird encoding (which could be much trickier if the language was spoken only by a scheming heavily RLed model)
Here is a concrete prediction: Claude 3.5 Sonnet, GPT-4o and Gemini 2 Pro will be able to understand the topics considered in 99%+ of Chain-of-Thoughts of all regular Transformers trained in 2025 and 2026 that were not deliberately trained or prompted to be harder to understand (p=0.7).
That’s a worthwhile research direction, but I don’t find the results here convincing. This experiment seems to involve picking an arbitrary and deliberately unfamiliar-to-the-LLM encoding, and trying to force the LLM to use it. That’s not the threat model with RL causing steganography, the idea there is the opposite: that there is some encoding which would come natural to the model, more natural than English, and that RL would beeline for it.
“LLMs are great at learning to think in arbitrary encodings” was never part of that threat model. The steganographic encoding would not be arbitrary nor alien-to-the-LLM.
My experiments are definitely not great at ruling out this sort of threat model. But I think they provide some evidence that LLMs are probably not great at manipulating generic non-pretraining encodings (the opposite result would have provided evidence that there might be encodings that are extremely easy to learn—I think my result (if they reproduce) do not reduce the variance between encodings, but they should shift the mean).
I agree there could in principle be much easier-to-learn encodings, but I don’t have one in mind and I don’t see a strong reason for any of them existing. What sorts of encoding do you expect to be natural to LLMs besides encodings already present in pretraining and that GPT-4o can decode? What would make a brand new encoding easy to learn? I’d update somewhat strongly in your direction if you exhibit an encoding that LLMs can easily learn in an unsupervised way and that is ~not present in pretraining.
What sorts of encoding do you expect to be natural to LLMs besides encodings already present in pretraining and that GPT-4o can decode?
Spitballing: Some sort of trivial-but-abstract combination of the encodings present in pretraining that produces an encoding that’s particularly easy for the LLM to think in due to their architecture/learned associations, but which is opaque to us, because the abstract space in which the combination is trivial is hopelessly beyond our current theory of languages, such that we can’t easily reverse-engineer it.
“Switching between languages every sentence” is a trivial first step. Next step might be something like “using poetic metaphors in every language” or “thinking in subtle information-dense multilingual puns”, such that you need actual human polyglots to translate. Beyond that, it might move to freestyling words by combining characters from random languages in ways such that they happen to evoke useful concepts in the LLM when seen through its omniglot-poet lens[1], but which teams of human polyglots need hours to untangle. At this stage, idiosyncratic meanings completely alien to us but visible to the LLM would also come into play, see the ” petertodd” phenomenon, so perhaps the polyglot teams would also need to have janus-style LLM whisperers present. And then it gets worse from there.
Edit: Here, I dug up an example. I’d run some experiments on it back in 2023 and it turned out to be much less impressive than it looks[2], but imagine if LLMs actually started to think in this kind of code.
Again, this is just a throwaway example scenario, but this is the sort of thing I expect. From my perspective, LLMs starting to switch between languages once RL’d is a clear precursor of such a nightmare scenario once RL is ramped up, not something that’s tame. (Though I admit it’s been less bad than I’d expected so far. I guess I haven’t seen what the trajectories of reasoning models of the second generation look like.)
E. g., using one language’s rules for creating compound words while taking the morphemes from different languages (and emoji, why not), such as surrounding a Japanese kanji with Latin and Russian prefixes and suffixes, and wrapping the whole thing in the poetry tradition of a fourth language.
The LLM doesn’t make the encoding incomprehensible if you don’t literally ask it to make it incomprehensible, and the encoding doesn’t change depending on whether you inform it that it’s for its future instance or not. So it’s not actually doing self-modeling to figure out the best way to encode text for itself. Still, it does demonstrate that it can reverse this sort of wild-looking code due to its learned webs of associations, in ways that would be very hard for humans to untangle.
Thanks for the example! I think it moderately convinced me that having CoT become hard for humans to understand in the next 2 years is slightly more plausible than I thought.
I still think that even for the things you described, it will be relatively easy for the base model to understand what is going on, and it’s likely that GPT-4o will too. See the discussion in another comment thread about “chicken-and-egg” problems when learning new encodings, which I think make it much easier to learn things like switching languages or skipping stopwords that take almost no additional effort to decode (therefore the it is incentivized to use the new encoding without the need to learn a new decoder). I see how once you get a model using “freestyling words” frequently the decoder becomes better over time at understanding freestyling in general (though it will be slow because it is trained using RL rather than SL), which allows you to eventually shift to encodings that the base model and GPT-4o don’t understand, but my guess is that this will take a long time (>the first 1e28 FLOP RL run?).
I still think that even for the things you described, it will be relatively easy for the base model to understand what is going on, and it’s likely that GPT-4o will too
Maaaybe. Note, though, that “understand what’s going on” isn’t the same as “faithfully and comprehensively translate what’s going on into English”. Any number of crucial nuances might be accidentally lost in translation (due to the decoder model not properly appreciating how important they are), or deliberately hidden (if the RL’d model performs a sneaky jailbreak on the decoder, see Pliny-style token bombs or jailbreaks encoded in metaphor).
I think it’s only somewhat more analogous: if you slowly transition to a new language, you don’t have incentives to have translations pairs with redundant content. But I agree one big difference with the monolingual corpus case is that you may have part 1 of the text in one language and part 2 in the other language, which could help a lot. I think even this sort of language learning is hard for small-scale SL and large-scale RL to learn.
(I also think that there is a more frightening version of the “new incomprehensible language” hypothesis: the case where you learn a new language to express thoughts which are hard to express in English. This is not analogous to any translation experiment, and I expect it to be brutal for small-scale SL and large-scale RL to learn a full language in which complex thoughts can be expressed.)
How much flop do you think it takes for large scale RL to learn codes? Let’s say RL learns 100x less than SL (seems plausible) and is only 10% as focused on learning new ways of thinking / languages as SL. Then, we’d expect that reasonably efficient RL with 3 OOMs more FLOP than big pretraining runs (that do learn this to some extent) could learn new languages. This would naively be a ~1e27 FLOP RL run (assuming that we can learn this much stuff in 1e24 FLOP pretraining runs).
I think we’ll probably see 1e27 FLOP RL runs next year?
You would also expect that the larger models will be more sample-efficient, including at in-context learning of variations of existing tasks (which of course is what steganography is). So all scale-ups go much further than any experiment at small-scale like 8B would indicate. (No idea what ‘medium-scale’ here might mean.)
First, I think RL is more like 10,000x than 100x less efficient than SL (deepseek v3 probably can’t be compressed much below 10GB, while deepseek r1-zero stage can probably be compressed to 1MB of transcripts, despite both being roughly 1e24 FLOP).
Additionally, learning new languages is likely a much harder task than regular math RL, because you have chicken-and-egg issues (you are not incentivized to encode a new structure before knowing how to decode it, and you are not incentivized to decode it before you encode it). We have some empirical evidence of this: people struggle to make even simple steganography appear in RL setups which incentivize it the most. Maybe simple encodings can bootstrap to more complex encodings and this bypasses some of the chicken-and-egg issues, but I’m not sure. Chicken-and-egg problems also mean you benefit a lot from having many serial steps of RL when learning an encoding, and while RL can probably be scaled massively in parallel, high latency of generations imply there probably won’t be massive scale ups of the number of serial steps of RL compared to what r1 already did. (This is a hand-wavy argument, it might be incorrect.)
First, I think RL is more like 10,000x than 100x less efficient than SL (deepseek v3 probably can’t be compressed much below 10GB, while deepseek r1-zero stage can probably be compressed to 1MB of transcripts, despite both being roughly 1e24 FLOP).
Maybe this is true for SL on trajectories from the post-RL policy, but this doesn’t clearly seem like the right way to think about it from my perspective.
Are the joint names separated by spaces if not, the tokenization is going to be totally broken more generally I would be interested to see this Tried with a code that EG maps familiar tokens to obscure ones or something like mapping token with id k to id maximum minus K. Tokens feel like the natural way in llm would represent its processing and thus encoded processing. Doing things in individual letters is kind of hard
Watch out, we know that there are some hard encodings. If all the names used are single token, this looks like tokenizing on characters, which we know is a harder problem than tokenizing as normal.
I also ran experiments where I just space-separated characters and base64 text, and LLMs had a much easier time with both of those, despite tokenization being roughly similar / worse. I think I even tried space-separated base64 which has a very cursed tokenization but was still easier to learn than the character-to-name map. I am still confused by why, in my experiments, LLMs struggled to speak using the character-to-name mapping. Maybe I did something wrong.
I think maybe blanking the embedding of the name tokens, and retraining, should make the effect go away?
But that is a discrete jump that would not be found by SGD, and would break performance on previous tasks.
Also for the random names, is the encoding random, (different names each time) or hashed (there is a lookup table)?
If truly random, we would expect FT to get high loss because it must predict through a random/pseudorandom function, which means a high problem entropy not high model loss.
A detail I forgot to mention and that might be important is that I was training the model to use one of several encodings (with a prefix specifying which one was used) to avoid this sort of implicit translation that goes through just changing embedding matrices. But that would have been a good sanity check to run to make sure that the model would learn to do the simple embedding matrix change when this is an optimal strategy (I would be very surprised if SGD was not able to find this simple transformation, what I claim is tricky is learning a new weird encoding while maintaining the other capabilities, though I am confused enough here that I don’t claim to make very accurate prediction in this domain).
in CoT finetuning, there is no explicit pressure against encoding co-evolution in the CoT and decoder model, if both are identical or fine tuned in the same pass.
This means that hidden reasoning might be harder? But there is a big disanalogy here. That switching of modes might be the hard part, not the encoding itself.
I think LLMs are even worse — not just with rare encodings, but also when it comes to reasoning with rare structures. Theory-of-mind tasks provide good evidence for this. LLMs aren’t good at inferring others’ mental states; rather, they tend to mimic reasoning when reasoning steps are present in the training data.
I might be a bit confused on the threat model here:
The scratchpads from OpenAI’s reward hacking paper are slightly more compact than regular English, but are using a kind of language that is not that far from English (and is probably not that far from what must be used on some corners of the internet at least as frequently as text encoded in random letter permutations).
To me the salient property of the “unmonitorable” CoTs in that paper are “you cannot predict the action from the CoT anymore”.
For example, is the claim with:
Here is a concrete prediction: Claude 3.5 Sonnet, GPT-4o and Gemini 2 Pro will be able to understand the topics considered in 99%+ of Chain-of-Thoughts of all regular Transformers trained in 2025 and 2026 that were not deliberately trained or prompted to be harder to understand (p=0.7).
Equivalent to:
Here is a concrete prediction: Claude 3.5 Sonnet, GPT-4o and Gemini 2 Pro will be able to serve as an effective CoT only monitor in 99%+ of Chain-of-Thoughts of all regular …
Or do you see the latter as a much stronger claim?
What is the threat model where a model needs to learn something as complex as a full encoding?
The threat model here is that the model uses the CoT in order to have much more complex schemes than what it could do in a forward pass. This is not about “you cannot predict the action from the CoT anymore”. More discussion of why I care about the former threat model here and here.
serve as an effective CoT only monitor
I’d specify for situations where doing something bad can’t be done in a single forward pass and requires CoT reasoning. This is harder to define / measure so I went with something simpler but which I think is a decent proxy. (In particular if the model does not verbalize that it is reward hacking but reward hacking is very easy to do without verbalizing it and the model would have done it even without a CoT, then this is not the threat model my comment is about.)
Thanks for the clarification, the connection I have in mind is that one empirical expectation I would have is:
(1) Assume model A is attempting to execute some complex task “on the side” (ex: systematically undermine research in a way that requires more than just a single forward pass)
(2) Assume model B, a significantly weaker monitor, has access to this CoT and would flag something as obvious as “we must continue secretly building up an understanding of how to effectively poison the posttraining data of another model”
(3) For a sufficiently weak monitor, I would expect applying direct optimization pressure to the CoT of model A based on “is the CoT flagged by model B” would result in a strategy like “slightly reframe the wording / terms you use when thinking about the ‘poison the posttraining data of another’ research” (ex: model A learns “frame ‘research poison the posttraining data’ as ‘research ensuring quality of posttraining data’).
The conclusion of this (to me) wouldn’t be against your post above (as you explicitly condition on there not being direct optimization pressure), but it might be something like “model A doesn’t need to necessarily learn something like a full encoding or incomprehensible language, it just needs to meet the much lower bar of ‘frame things such that you don’t trigger the monitor’”.
Specifically my expectation would be that even without deliberate prompting or training to make their CoT harder to understand, there will be models by 2026 which can conduct multi-turn actions requiring CoT reasoning that GPT-4o / Gemini 2 Pro could not predict via CoT monitoring, without the 2026 era models having to learn a full encoding / a new incomprehensible language”.
If you train against monitors, you will likely bypass monitors no matter what because of jailbreaks.
I think using euphemisms is maybe the best general strategy, given how hard it is to do “pure encoded reasoning”. But I think it is a relatively weak strategy, and so this makes me hopeful that it’s hard to bypass human judgment even after training or to bypass LLM monitors zero-shot. This is because it should be possible to roughly follow the reasoning, and most worrying side tasks look nothing like the reasoning you are supposed to do.
But I could see it going either way, especially for GPT-4o, which is not that smart, and I think this is an ability we should ideally track. I’m curious if you know about examples of humans thinking out loud about one problem while solving another by hiding things in their out loud thinking, or if you think you could produce a math CoT which solves one random MATH problem while looking like it is solving another.
I’m curious if you know about examples of humans thinking out loud about one problem while solving another by hiding things in their out loud thinking, or if you think you could produce a math CoT which solves one random MATH problem while looking like it is solving another.
A particular case I usually have in mind is:
Documenting, researching, or understanding the potential failure modes or exploits around a problem (for example, “note: this data MUST have property Y otherwise it is vulnerable to Z”) often looks very similar regardless of whether your intent is “definitely always avoid these” or “eventually exploit these”
For example, one thing you could reasonably want to document if you’re an automated AI alignment researcher is the subtle pitfalls in training techniques, dataset composition, etc that would result in failure modes (ex: at some point you want to make detailed notes about different ways that pressure against cot does / doesn’t induce obfuscation)
It seems like in these cases, there’s a large overlap between “aggregate a bunch of ways to sabotage training” and “genuinely make sure that you understand the dynamics of the problem space”
I think many problems in security and alignment research often have this property (for a system responsible for broad scoped long horizon tasks) where “deeply understand ways that this could go wrong to avoid them” and “deeply understand ways this could go wrong so you can exploit them” have enough overlap to potentially make this subset of research ambiguous enough in its intent to avoid being flagged by a monitor (or equivalently, ignored by human researchers once flagged).
I don’t think GSM8k problems (for example) have this property, but would be interested in any kind of minimal setup (for example some environment with some artifical ML task which has an an artificially large overlap to make this easier to experiment with)
Do you think this means that if Claude 3.7 Sonnet does in fact use steganography then your earlier experiment wouldn’t have found evidence for it, since you didn’t fine-tune for long enough for the model to learn the scheme?
Of course this experiment also suggests one should lower their likelihood of Claude 3.7 Sonnet using steganography at all.
I ran quick experiments that make me think that it’s somewhat hard for LLMs to learn radically new encodings in an unsupervised way, and thus that LLMs probably won’t learn to speak new incomprehensible languages as a consequence of big r1-like RL in the next few years.
The experiments
I trained Llama 3-8B and some medium-size internal Anthropic models to speak using an encoding style that is very rare on the internet (e.g. map each letter to a random name, and join the names) with SFT on the encoded text and without providing translation pairs. I find that the resulting models:
Have relatively high next-token-prediction losses
Can’t speak well (e.g. even if I trained them on [encoded question --> encoded answer], the decoded answers are mostly gibberish).
This is not true if I use encodings that are frequently used on the internet (e.g. base64, random letter permutations, …), and this is less true if I add translation pairs to the training mix.
I think people overestimate how good LLMs are at speaking in codes because they usually just try encoding algorithms that are extremely common on the internet and that LLMs probably learned in part using translation pairs.
These experiments were done at a relatively small scale (80k sequences of ~50 tokens each, for 40 epochs, for a total of 160M tokens), and I could imagine things being different for large scale SFT or SFT with better base models. But given that RL doesn’t teach models that many new bits (you can distill r1 scratchpads using only a few MB of transcripts), I think this is still informative for large-scale RL.
Context
I ended up giving up on the broader projects as part of which I ran these experiments, but I think they are decently informative about the plausibility of LLMs learning to speak new incomprehensible languages.
I did not carefully run any of the experiments above, and I welcome replications.
Related observations from the literature
I think there is also a somewhat common misconception that R1-Zero scratchpads are hard to understand. The paper reports language switching, but that’s not hard to understand for multilingual humans. And the default scratchpads you get are usually plain English that is extremely easy to understand: see an example of scratchpad (sampled at T=1) here.
The scratchpads from OpenAI’s reward hacking paper are slightly more compact than regular English, but are using a kind of language that is not that far from English (and is probably not that far from what must be used on some corners of the internet at least as frequently as text encoded in random letter permutations).
Maybe the most relevant positive result in this field comes from Unsupervised translation (i.e. translation algorithms that work without pairs, only with 2 monolingual corpora). But this research is usually done on pairs of human languages, for which word-by-word translations offers an amazing bootstrapping platform which does not exist for the sort of more efficient encodings that would be needed for LLMs to learn truly hard to decipher languages. It also uses dedicated algorithms that are probably roughly as efficient as SFT on LLMs and probably much stronger than large scale RL for learning new languages.
More discussion
Anecdotally, when LLMs speak other languages (e.g. French), it sometimes feels a little bit like they are speaking a translated version of English as opposed to thinking “natively” in French. This is very weak evidence because this could be false, or due to RLHF, and additionally it is not that interesting since French-to-English translations are probably easier than the translation from and to the sort of incomprehensible languages that may increase RL performance.
Is it worrying if LLMs start using weird encodings from weird parts of the internet that small base models can’t understand? I think it’s somewhat worrying, but not as much as “de novo” incomprehensible languages because:
It means some humans can probably translate back and forth from this encoding quite easily, which means it should be feasible for people at labs to notice and learn this encoding
Light SFT on top of the strongest base models is likely to elicit a translator for the weird encoding (which could be much trickier if the language was spoken only by a scheming heavily RLed model)
Here is a concrete prediction: Claude 3.5 Sonnet, GPT-4o and Gemini 2 Pro will be able to understand the topics considered in 99%+ of Chain-of-Thoughts of all regular Transformers trained in 2025 and 2026 that were not deliberately trained or prompted to be harder to understand (p=0.7).
That’s a worthwhile research direction, but I don’t find the results here convincing. This experiment seems to involve picking an arbitrary and deliberately unfamiliar-to-the-LLM encoding, and trying to force the LLM to use it. That’s not the threat model with RL causing steganography, the idea there is the opposite: that there is some encoding which would come natural to the model, more natural than English, and that RL would beeline for it.
“LLMs are great at learning to think in arbitrary encodings” was never part of that threat model. The steganographic encoding would not be arbitrary nor alien-to-the-LLM.
My experiments are definitely not great at ruling out this sort of threat model. But I think they provide some evidence that LLMs are probably not great at manipulating generic non-pretraining encodings (the opposite result would have provided evidence that there might be encodings that are extremely easy to learn—I think my result (if they reproduce) do not reduce the variance between encodings, but they should shift the mean).
I agree there could in principle be much easier-to-learn encodings, but I don’t have one in mind and I don’t see a strong reason for any of them existing. What sorts of encoding do you expect to be natural to LLMs besides encodings already present in pretraining and that GPT-4o can decode? What would make a brand new encoding easy to learn? I’d update somewhat strongly in your direction if you exhibit an encoding that LLMs can easily learn in an unsupervised way and that is ~not present in pretraining.
Spitballing: Some sort of trivial-but-abstract combination of the encodings present in pretraining that produces an encoding that’s particularly easy for the LLM to think in due to their architecture/learned associations, but which is opaque to us, because the abstract space in which the combination is trivial is hopelessly beyond our current theory of languages, such that we can’t easily reverse-engineer it.
“Switching between languages every sentence” is a trivial first step. Next step might be something like “using poetic metaphors in every language” or “thinking in subtle information-dense multilingual puns”, such that you need actual human polyglots to translate. Beyond that, it might move to freestyling words by combining characters from random languages in ways such that they happen to evoke useful concepts in the LLM when seen through its omniglot-poet lens[1], but which teams of human polyglots need hours to untangle. At this stage, idiosyncratic meanings completely alien to us but visible to the LLM would also come into play, see the ” petertodd” phenomenon, so perhaps the polyglot teams would also need to have janus-style LLM whisperers present. And then it gets worse from there.
Edit: Here, I dug up an example. I’d run some experiments on it back in 2023 and it turned out to be much less impressive than it looks[2], but imagine if LLMs actually started to think in this kind of code.
Again, this is just a throwaway example scenario, but this is the sort of thing I expect. From my perspective, LLMs starting to switch between languages once RL’d is a clear precursor of such a nightmare scenario once RL is ramped up, not something that’s tame. (Though I admit it’s been less bad than I’d expected so far. I guess I haven’t seen what the trajectories of reasoning models of the second generation look like.)
E. g., using one language’s rules for creating compound words while taking the morphemes from different languages (and emoji, why not), such as surrounding a Japanese kanji with Latin and Russian prefixes and suffixes, and wrapping the whole thing in the poetry tradition of a fourth language.
The LLM doesn’t make the encoding incomprehensible if you don’t literally ask it to make it incomprehensible, and the encoding doesn’t change depending on whether you inform it that it’s for its future instance or not. So it’s not actually doing self-modeling to figure out the best way to encode text for itself. Still, it does demonstrate that it can reverse this sort of wild-looking code due to its learned webs of associations, in ways that would be very hard for humans to untangle.
Thanks for the example! I think it moderately convinced me that having CoT become hard for humans to understand in the next 2 years is slightly more plausible than I thought.
I still think that even for the things you described, it will be relatively easy for the base model to understand what is going on, and it’s likely that GPT-4o will too. See the discussion in another comment thread about “chicken-and-egg” problems when learning new encodings, which I think make it much easier to learn things like switching languages or skipping stopwords that take almost no additional effort to decode (therefore the it is incentivized to use the new encoding without the need to learn a new decoder). I see how once you get a model using “freestyling words” frequently the decoder becomes better over time at understanding freestyling in general (though it will be slow because it is trained using RL rather than SL), which allows you to eventually shift to encodings that the base model and GPT-4o don’t understand, but my guess is that this will take a long time (>the first 1e28 FLOP RL run?).
Maaaybe. Note, though, that “understand what’s going on” isn’t the same as “faithfully and comprehensively translate what’s going on into English”. Any number of crucial nuances might be accidentally lost in translation (due to the decoder model not properly appreciating how important they are), or deliberately hidden (if the RL’d model performs a sneaky jailbreak on the decoder, see Pliny-style token bombs or jailbreaks encoded in metaphor).
I think the RL case might be more analogous to the translation pair case than the “just the encoded text” case. How does that alter the bottom line?
I think it’s only somewhat more analogous: if you slowly transition to a new language, you don’t have incentives to have translations pairs with redundant content. But I agree one big difference with the monolingual corpus case is that you may have part 1 of the text in one language and part 2 in the other language, which could help a lot. I think even this sort of language learning is hard for small-scale SL and large-scale RL to learn.
(I also think that there is a more frightening version of the “new incomprehensible language” hypothesis: the case where you learn a new language to express thoughts which are hard to express in English. This is not analogous to any translation experiment, and I expect it to be brutal for small-scale SL and large-scale RL to learn a full language in which complex thoughts can be expressed.)
How much flop do you think it takes for large scale RL to learn codes? Let’s say RL learns 100x less than SL (seems plausible) and is only 10% as focused on learning new ways of thinking / languages as SL. Then, we’d expect that reasonably efficient RL with 3 OOMs more FLOP than big pretraining runs (that do learn this to some extent) could learn new languages. This would naively be a ~1e27 FLOP RL run (assuming that we can learn this much stuff in 1e24 FLOP pretraining runs).
I think we’ll probably see 1e27 FLOP RL runs next year?
You would also expect that the larger models will be more sample-efficient, including at in-context learning of variations of existing tasks (which of course is what steganography is). So all scale-ups go much further than any experiment at small-scale like 8B would indicate. (No idea what ‘medium-scale’ here might mean.)
I think it’s much worse than that.
First, I think RL is more like 10,000x than 100x less efficient than SL (deepseek v3 probably can’t be compressed much below 10GB, while deepseek r1-zero stage can probably be compressed to 1MB of transcripts, despite both being roughly 1e24 FLOP).
Additionally, learning new languages is likely a much harder task than regular math RL, because you have chicken-and-egg issues (you are not incentivized to encode a new structure before knowing how to decode it, and you are not incentivized to decode it before you encode it). We have some empirical evidence of this: people struggle to make even simple steganography appear in RL setups which incentivize it the most. Maybe simple encodings can bootstrap to more complex encodings and this bypasses some of the chicken-and-egg issues, but I’m not sure. Chicken-and-egg problems also mean you benefit a lot from having many serial steps of RL when learning an encoding, and while RL can probably be scaled massively in parallel, high latency of generations imply there probably won’t be massive scale ups of the number of serial steps of RL compared to what r1 already did. (This is a hand-wavy argument, it might be incorrect.)
Maybe this is true for SL on trajectories from the post-RL policy, but this doesn’t clearly seem like the right way to think about it from my perspective.
Doesn’t SL already learn a full language in which complex thoughts can be expressed? I agree it was brutal, it required >1e24 flop, but still.
Edited to say “small-scale SL and large-scale RL”
Are the joint names separated by spaces if not, the tokenization is going to be totally broken more generally I would be interested to see this Tried with a code that EG maps familiar tokens to obscure ones or something like mapping token with id k to id maximum minus K. Tokens feel like the natural way in llm would represent its processing and thus encoded processing. Doing things in individual letters is kind of hard
They were separated by spaces. (But I’d encourage replication before updating too hard on results which I think are very weird.)
Watch out, we know that there are some hard encodings. If all the names used are single token, this looks like tokenizing on characters, which we know is a harder problem than tokenizing as normal.
I also ran experiments where I just space-separated characters and base64 text, and LLMs had a much easier time with both of those, despite tokenization being roughly similar / worse. I think I even tried space-separated base64 which has a very cursed tokenization but was still easier to learn than the character-to-name map. I am still confused by why, in my experiments, LLMs struggled to speak using the character-to-name mapping. Maybe I did something wrong.
I think maybe blanking the embedding of the name tokens, and retraining, should make the effect go away?
But that is a discrete jump that would not be found by SGD, and would break performance on previous tasks.
Also for the random names, is the encoding random, (different names each time) or hashed (there is a lookup table)?
If truly random, we would expect FT to get high loss because it must predict through a random/pseudorandom function, which means a high problem entropy not high model loss.
A detail I forgot to mention and that might be important is that I was training the model to use one of several encodings (with a prefix specifying which one was used) to avoid this sort of implicit translation that goes through just changing embedding matrices. But that would have been a good sanity check to run to make sure that the model would learn to do the simple embedding matrix change when this is an optimal strategy (I would be very surprised if SGD was not able to find this simple transformation, what I claim is tricky is learning a new weird encoding while maintaining the other capabilities, though I am confused enough here that I don’t claim to make very accurate prediction in this domain).
The encoding was a lookup table.
in CoT finetuning, there is no explicit pressure against encoding co-evolution in the CoT and decoder model, if both are identical or fine tuned in the same pass.
This means that hidden reasoning might be harder? But there is a big disanalogy here. That switching of modes might be the hard part, not the encoding itself.
I think LLMs are even worse — not just with rare encodings, but also when it comes to reasoning with rare structures. Theory-of-mind tasks provide good evidence for this. LLMs aren’t good at inferring others’ mental states; rather, they tend to mimic reasoning when reasoning steps are present in the training data.
I might be a bit confused on the threat model here:
To me the salient property of the “unmonitorable” CoTs in that paper are “you cannot predict the action from the CoT anymore”.
For example, is the claim with:
Equivalent to:
Or do you see the latter as a much stronger claim?
What is the threat model where a model needs to learn something as complex as a full encoding?
The threat model here is that the model uses the CoT in order to have much more complex schemes than what it could do in a forward pass. This is not about “you cannot predict the action from the CoT anymore”. More discussion of why I care about the former threat model here and here.
I’d specify for situations where doing something bad can’t be done in a single forward pass and requires CoT reasoning. This is harder to define / measure so I went with something simpler but which I think is a decent proxy. (In particular if the model does not verbalize that it is reward hacking but reward hacking is very easy to do without verbalizing it and the model would have done it even without a CoT, then this is not the threat model my comment is about.)
Thanks for the clarification, the connection I have in mind is that one empirical expectation I would have is:
(1) Assume model A is attempting to execute some complex task “on the side” (ex: systematically undermine research in a way that requires more than just a single forward pass)
(2) Assume model B, a significantly weaker monitor, has access to this CoT and would flag something as obvious as “we must continue secretly building up an understanding of how to effectively poison the posttraining data of another model”
(3) For a sufficiently weak monitor, I would expect applying direct optimization pressure to the CoT of model A based on “is the CoT flagged by model B” would result in a strategy like “slightly reframe the wording / terms you use when thinking about the ‘poison the posttraining data of another’ research” (ex: model A learns “frame ‘research poison the posttraining data’ as ‘research ensuring quality of posttraining data’).
The conclusion of this (to me) wouldn’t be against your post above (as you explicitly condition on there not being direct optimization pressure), but it might be something like “model A doesn’t need to necessarily learn something like a full encoding or incomprehensible language, it just needs to meet the much lower bar of ‘frame things such that you don’t trigger the monitor’”.
Specifically my expectation would be that even without deliberate prompting or training to make their CoT harder to understand, there will be models by 2026 which can conduct multi-turn actions requiring CoT reasoning that GPT-4o / Gemini 2 Pro could not predict via CoT monitoring, without the 2026 era models having to learn a full encoding / a new incomprehensible language”.
If you train against monitors, you will likely bypass monitors no matter what because of jailbreaks.
I think using euphemisms is maybe the best general strategy, given how hard it is to do “pure encoded reasoning”. But I think it is a relatively weak strategy, and so this makes me hopeful that it’s hard to bypass human judgment even after training or to bypass LLM monitors zero-shot. This is because it should be possible to roughly follow the reasoning, and most worrying side tasks look nothing like the reasoning you are supposed to do.
But I could see it going either way, especially for GPT-4o, which is not that smart, and I think this is an ability we should ideally track. I’m curious if you know about examples of humans thinking out loud about one problem while solving another by hiding things in their out loud thinking, or if you think you could produce a math CoT which solves one random MATH problem while looking like it is solving another.
A particular case I usually have in mind is:
Documenting, researching, or understanding the potential failure modes or exploits around a problem (for example, “note: this data MUST have property Y otherwise it is vulnerable to Z”) often looks very similar regardless of whether your intent is “definitely always avoid these” or “eventually exploit these”
For example, one thing you could reasonably want to document if you’re an automated AI alignment researcher is the subtle pitfalls in training techniques, dataset composition, etc that would result in failure modes (ex: at some point you want to make detailed notes about different ways that pressure against cot does / doesn’t induce obfuscation)
It seems like in these cases, there’s a large overlap between “aggregate a bunch of ways to sabotage training” and “genuinely make sure that you understand the dynamics of the problem space”
I think many problems in security and alignment research often have this property (for a system responsible for broad scoped long horizon tasks) where “deeply understand ways that this could go wrong to avoid them” and “deeply understand ways this could go wrong so you can exploit them” have enough overlap to potentially make this subset of research ambiguous enough in its intent to avoid being flagged by a monitor (or equivalently, ignored by human researchers once flagged).
I don’t think GSM8k problems (for example) have this property, but would be interested in any kind of minimal setup (for example some environment with some artifical ML task which has an an artificially large overlap to make this easier to experiment with)
Do you think this means that if Claude 3.7 Sonnet does in fact use steganography then your earlier experiment wouldn’t have found evidence for it, since you didn’t fine-tune for long enough for the model to learn the scheme?
Of course this experiment also suggests one should lower their likelihood of Claude 3.7 Sonnet using steganography at all.