I don’t think you need much incentives to develop steganography, only fact “human language is not platonic example of efficiency in conveying thoughts”.
There are multiple examples of o1 producing gibberish in its COT summary (I won’t insert them right now because linking stuff from mobile is painful, will edit this comment later). Gibberish seems reasonable given description of training process, which optimizes model based only on final answer. I imagine a process like this:
Sometimes, model just produces gibberish in COT just because of lack of robustness
This gibberish gets reinforced
Model learns to utilize it just like it learns to utilize ”...” tokens
Continue process for a long enough time and you are going to get a lot of <untranslatable_14637> in COT, even if model doesn’t “try to hide” its reasoning. Also, I guess “gibberish resulting from non-robustness” is in itself more native format of thought for LLMs than imitating human thinking out loud.
The other factor of developing steganography is economic: if you optimize model to spend minimal number of tokens, you get model that “skips” or compress some thoughts.
I don’t think you need much incentives to develop steganography, only fact “human language is not platonic example of efficiency in conveying thoughts”.
There are multiple examples of o1 producing gibberish in its COT summary (I won’t insert them right now because linking stuff from mobile is painful, will edit this comment later). Gibberish seems reasonable given description of training process, which optimizes model based only on final answer. I imagine a process like this:
Sometimes, model just produces gibberish in COT just because of lack of robustness
This gibberish gets reinforced
Model learns to utilize it just like it learns to utilize ”...” tokens
Continue process for a long enough time and you are going to get a lot of <untranslatable_14637> in COT, even if model doesn’t “try to hide” its reasoning. Also, I guess “gibberish resulting from non-robustness” is in itself more native format of thought for LLMs than imitating human thinking out loud.
The other factor of developing steganography is economic: if you optimize model to spend minimal number of tokens, you get model that “skips” or compress some thoughts.