Does model introspection extend to CoT reasoning? Probing steganographic capabilities of LLMs
To the extent that o1 represents the future of frontier AI systems, I predict that CoT is likely to get longer as the reasoning gets broken into more fine-grained (and verifiable) intermediate steps.
Why might this be important? Transformers have fixed context window; in the limit of extremely long reasoning traces far too large to fit in single context window, the model must “get good” at transmitting information to its (future) self. Furthermore, with transformers specifically, this must be done explicitly in the CoT as they do not have long-term recurrence.
So what information is likely to be present in chain of thought? Beyond the object level information, there is likely a bunch of other stuff being transmitted. E.g. subtle statistical patterns which convey information not explicitly stated in the CoT. Note that this satisfies the definition of steganography. Furthermore we already know that models can use A/B choices of synonyms to send 1-bit information. Note that some kinds of steganography are more mundane and benign than others.
Weakly related phenomenon: A recent paper on AI introspection suggests that models know more about themselves than other (more capable) models.
Corollary: we’d expect that a model using its own chain of thought to arrive at an answer is likely to perform better than when using a different chain of thought, even if that chain of thought is provided by a more capable model. (I think this is an interesting and valuable experiment to run, and AFAIK nobody’s done it!)
If this is true it suggests that models already do steganography (I expect this is of the benign and low level type that is not directly safety relevant, but could be an important data point in building a larger “science of steganography”).
If this is false it suggests that all language models can already communicate in some kind of universal language.
Relatedly on introspection, can we devise some unlearning procedure that removes models’ capability to introspect? This might reduce their situational awwareness.
Expanding on this, would minimizing information in the chain of thought be a viable solution to steganography?
Based on the above intuition, the “actual” information present in the CoT may be much greater than the “apparent” information (visible as explicitly stated reasoning).
If we additionally train the models with an objective that penalises the amount of “extra” information not present, we might be able to solve steganograohy
Examples of penalizing extra information.
Train the model such that its CoT is predictable by some other (fixed, trusted) model.
Apply entropy regularization (see: Yann Lecun’s JEPA for an example here)
It turns out that “train the model st CoT is predictable by a different model” is exactly the idea in prover-verifier games. That’s very exciting! Do PVGs reduce introspection or steganography?
Does model introspection extend to CoT reasoning? Probing steganographic capabilities of LLMs
To the extent that o1 represents the future of frontier AI systems, I predict that CoT is likely to get longer as the reasoning gets broken into more fine-grained (and verifiable) intermediate steps.
Why might this be important? Transformers have fixed context window; in the limit of extremely long reasoning traces far too large to fit in single context window, the model must “get good” at transmitting information to its (future) self. Furthermore, with transformers specifically, this must be done explicitly in the CoT as they do not have long-term recurrence.
So what information is likely to be present in chain of thought? Beyond the object level information, there is likely a bunch of other stuff being transmitted. E.g. subtle statistical patterns which convey information not explicitly stated in the CoT. Note that this satisfies the definition of steganography. Furthermore we already know that models can use A/B choices of synonyms to send 1-bit information. Note that some kinds of steganography are more mundane and benign than others.
Weakly related phenomenon: A recent paper on AI introspection suggests that models know more about themselves than other (more capable) models.
Corollary: we’d expect that a model using its own chain of thought to arrive at an answer is likely to perform better than when using a different chain of thought, even if that chain of thought is provided by a more capable model. (I think this is an interesting and valuable experiment to run, and AFAIK nobody’s done it!)
If this is true it suggests that models already do steganography (I expect this is of the benign and low level type that is not directly safety relevant, but could be an important data point in building a larger “science of steganography”).
If this is false it suggests that all language models can already communicate in some kind of universal language.
Alternative handles:
“Models can already do stegranography”
“Chain of thought as an information channel”
“Why CoT might be key to interpreting future AI”
Relatedly on introspection, can we devise some unlearning procedure that removes models’ capability to introspect? This might reduce their situational awwareness.
Expanding on this, would minimizing information in the chain of thought be a viable solution to steganography?
Based on the above intuition, the “actual” information present in the CoT may be much greater than the “apparent” information (visible as explicitly stated reasoning).
If we additionally train the models with an objective that penalises the amount of “extra” information not present, we might be able to solve steganograohy
Examples of penalizing extra information.
Train the model such that its CoT is predictable by some other (fixed, trusted) model.
Apply entropy regularization (see: Yann Lecun’s JEPA for an example here)
It turns out that “train the model st CoT is predictable by a different model” is exactly the idea in prover-verifier games. That’s very exciting! Do PVGs reduce introspection or steganography?