Max Nadeau recently made a comment on another post that gave an opinionated summary of a lot of existing CoT faithfulness work, including steganography. I’d recommend reading that. I’m not aware of very much relevant literature here; it’s possible it exists and I haven’t heard about it, but I think it’s also possible that this is a new conversation that exists in tweets more than papers so far.
Noting a difference between steganography and linguistic drift: I think rephrasing doesn’t make much sense as a defense against strong linguistic drift. If your AI is solving hard sequential reasoning problems with CoT that looks like “Blix farnozzium grellik thopswen…,” what is rephrasing going to do for you?
I haven’t looked at this closely enough to see if it’s really relevant, but it does say in the abstract “We find that agents that were initially pretrained to produce natural language can also experience detrimental language drift: when a non-linguistic reward is used in a goal-based task, e.g. some scalar success metric, the communication protocol may easily and radically diverge from natural language.” That sounds relevant.
In the o1 blog post, OpenAI said (under one interpretation) that they didn’t want to just penalize the model for saying harmful or deceptive plans in the CoT because that might lead it to keep having those plans but not writing them in CoT.
“Assuming it is faithful and legible, the hidden chain of thought allows us to “read the mind” of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought.”
Max Nadeau recently made a comment on another post that gave an opinionated summary of a lot of existing CoT faithfulness work, including steganography. I’d recommend reading that. I’m not aware of very much relevant literature here; it’s possible it exists and I haven’t heard about it, but I think it’s also possible that this is a new conversation that exists in tweets more than papers so far.
Paper on inducing steganography and combatting it with rephrasing: Preventing Language Models From Hiding Their Reasoning
Noting a difference between steganography and linguistic drift: I think rephrasing doesn’t make much sense as a defense against strong linguistic drift. If your AI is solving hard sequential reasoning problems with CoT that looks like “Blix farnozzium grellik thopswen…,” what is rephrasing going to do for you?
Countering Language Drift via Visual Grounding (Meta, 2019)
I haven’t looked at this closely enough to see if it’s really relevant, but it does say in the abstract “We find that agents that were initially pretrained to produce natural language can also experience detrimental language drift: when a non-linguistic reward is used in a goal-based task, e.g. some scalar success metric, the communication protocol may easily and radically diverge from natural language.” That sounds relevant.
Andrej Karpathy suggesting that pushing o1-style RL further is likely to lead to linguistic drift: https://x.com/karpathy/status/1835561952258723930?s=46&t=foMweExRiWvAyWixlTSaFA
In the o1 blog post, OpenAI said (under one interpretation) that they didn’t want to just penalize the model for saying harmful or deceptive plans in the CoT because that might lead it to keep having those plans but not writing them in CoT.
“Assuming it is faithful and legible, the hidden chain of thought allows us to “read the mind” of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought.”