I think this is a really good line of research, thanks. Some notes:
Hidden reasoning to me is any reasoning a model performs which isn’t apparent to an overseer. Hidden internal states is one way this could occur, but another way is for models to infer different information from some text. Models (and humans) infer a large amount of information from seemingly small pieces of text. For example, a language model would have to model all sorts of things about the generator of a piece of text while processing to generate new tokens. It uses each token as a conditional over a very high-dimensional “context” prior, a conditional which encodes a lot of information computed implicitly rather than through written tokens. LLMs are superhuman at this by their very nature.
Analogously, if you have a setting with two tasks—one obvious and one hidden—then a risk would be reasoning tokens for the first task outputted such that it provides conditional information for the second task. You allude to this in your conclusion, where you say:
Models could still reason steganographically. (Although in some other toy experiments, I found it surprisingly hard to make them spontaneously learn a steganographic code, even when they had to pass only one bit of hidden information[3]. Models could learn it only after seeing many explicit steganography demonstrations in the training data. This made me think that steganography may require the model to intentionally, non-myopically try to hide information, because in the short-term, model updates seem to not reinforce it.
I think that’s too narrow a definition of steganography. Models pass information legible only to “themselves” all the time, without requiring explicit reasoning. But this is benefitted a lot from high-bandwidth information channels. It’s much harder for a model to infer conditional information from the way it writes “two 3 double 4 one 2 triple 3″ than when writing a long chain of reasoning for how to increase share prices[1].
Toy tasks seem to run into this problem hard, given how restricted the information you can pass through tokens are. I think they’re good at testing a specific instantiation of hidden serial reasoning, that differentiates your Mamba variant and transformers, but not general hidden serial reasoning.
Concretely, consider a setting where a model is continuing a story from a pre-written prologue. At any given point, if prompted with “—Written by” or some such, it would get it right much more often than random, despite never reasoning about such a thing in the text, and no tokens being explicitly dedicated to it. There’s a separate claim that could be made about how much reasoning can be passed through these hidden conditionals, but that’s a quantitative question and almost always I hear the claim about “scheming in one forward pass is unlikely” instead, sidestepping that question entirely.
To clarify what I did in these toy steganography experiments: First I tried to pass some more complex information and gave models more freedom in what to output. But as I kept failing at that, I kept simplifying the setup, until I tried to have a model pass just one bit to another model, assuming they both see one secret bit and an overseer tries to guess the message but doesn’t see the secret bit. So a perfect code exists: XORing the message with the secret bit, but the models kept failing to find it. I expect that more complex settings will have a similar problem—that it’s hard to create a new encoding for some information because it’s rewarded only once it is both successfully sent and successfully received.
But the settings you describe may be different by the fact that they are natural, and there could be some unintentionally outputted subtext that accidentally correlates with the secret information. So the model would already by default “successfully send” the encoded information—now it just needs to learn to decode it.
It’s not clear to me to what extent this happens. I would definitely want to see some concrete demo of this happening, but right now I don’t have any good ideas how to elicit it.
Also, I wonder how far could we get by using adversaries who try to decode the information encoded in the model’s outputs.
Also, now that I think of it, it’s different for the model to:
represent hidden subtle information which is natural for the task, like in that example you gave with “written by”
using that hidden subtext for computation which is unnatural in the given context—to covertly do some other task than just the one it is asked for
While 1. definitely happens, I see no reason why 2. would be learned in pre-training. In principle 2. could be learned in RL fine-tuning, if there’s some incentive to use hidden reasoning. Although I imagine it must be hard.
I think this is a really good line of research, thanks. Some notes:
Hidden reasoning to me is any reasoning a model performs which isn’t apparent to an overseer. Hidden internal states is one way this could occur, but another way is for models to infer different information from some text. Models (and humans) infer a large amount of information from seemingly small pieces of text. For example, a language model would have to model all sorts of things about the generator of a piece of text while processing to generate new tokens. It uses each token as a conditional over a very high-dimensional “context” prior, a conditional which encodes a lot of information computed implicitly rather than through written tokens. LLMs are superhuman at this by their very nature.
Analogously, if you have a setting with two tasks—one obvious and one hidden—then a risk would be reasoning tokens for the first task outputted such that it provides conditional information for the second task. You allude to this in your conclusion, where you say:
I think that’s too narrow a definition of steganography. Models pass information legible only to “themselves” all the time, without requiring explicit reasoning. But this is benefitted a lot from high-bandwidth information channels. It’s much harder for a model to infer conditional information from the way it writes “two 3 double 4 one 2 triple 3″ than when writing a long chain of reasoning for how to increase share prices[1].
Toy tasks seem to run into this problem hard, given how restricted the information you can pass through tokens are. I think they’re good at testing a specific instantiation of hidden serial reasoning, that differentiates your Mamba variant and transformers, but not general hidden serial reasoning.
Concretely, consider a setting where a model is continuing a story from a pre-written prologue. At any given point, if prompted with “—Written by” or some such, it would get it right much more often than random, despite never reasoning about such a thing in the text, and no tokens being explicitly dedicated to it. There’s a separate claim that could be made about how much reasoning can be passed through these hidden conditionals, but that’s a quantitative question and almost always I hear the claim about “scheming in one forward pass is unlikely” instead, sidestepping that question entirely.
That’s a really good point.
To clarify what I did in these toy steganography experiments: First I tried to pass some more complex information and gave models more freedom in what to output. But as I kept failing at that, I kept simplifying the setup, until I tried to have a model pass just one bit to another model, assuming they both see one secret bit and an overseer tries to guess the message but doesn’t see the secret bit. So a perfect code exists: XORing the message with the secret bit, but the models kept failing to find it. I expect that more complex settings will have a similar problem—that it’s hard to create a new encoding for some information because it’s rewarded only once it is both successfully sent and successfully received.
But the settings you describe may be different by the fact that they are natural, and there could be some unintentionally outputted subtext that accidentally correlates with the secret information. So the model would already by default “successfully send” the encoded information—now it just needs to learn to decode it.
It’s not clear to me to what extent this happens. I would definitely want to see some concrete demo of this happening, but right now I don’t have any good ideas how to elicit it.
Also, I wonder how far could we get by using adversaries who try to decode the information encoded in the model’s outputs.
Also, now that I think of it, it’s different for the model to:
represent hidden subtle information which is natural for the task, like in that example you gave with “written by”
using that hidden subtext for computation which is unnatural in the given context—to covertly do some other task than just the one it is asked for
While 1. definitely happens, I see no reason why 2. would be learned in pre-training. In principle 2. could be learned in RL fine-tuning, if there’s some incentive to use hidden reasoning. Although I imagine it must be hard.