Expanding on this, would minimizing information in the chain of thought be a viable solution to steganography?
Based on the above intuition, the “actual” information present in the CoT may be much greater than the “apparent” information (visible as explicitly stated reasoning).
If we additionally train the models with an objective that penalises the amount of “extra” information not present, we might be able to solve steganograohy
Examples of penalizing extra information.
Train the model such that its CoT is predictable by some other (fixed, trusted) model.
Apply entropy regularization (see: Yann Lecun’s JEPA for an example here)
It turns out that “train the model st CoT is predictable by a different model” is exactly the idea in prover-verifier games. That’s very exciting! Do PVGs reduce introspection or steganography?
Expanding on this, would minimizing information in the chain of thought be a viable solution to steganography?
Based on the above intuition, the “actual” information present in the CoT may be much greater than the “apparent” information (visible as explicitly stated reasoning).
If we additionally train the models with an objective that penalises the amount of “extra” information not present, we might be able to solve steganograohy
Examples of penalizing extra information.
Train the model such that its CoT is predictable by some other (fixed, trusted) model.
Apply entropy regularization (see: Yann Lecun’s JEPA for an example here)
It turns out that “train the model st CoT is predictable by a different model” is exactly the idea in prover-verifier games. That’s very exciting! Do PVGs reduce introspection or steganography?