Models doing steganography mess up oversight of language models that only measure the outward text produced. If current methods for training models, such as RLHF, can induce steg, then that would be good to know so we can avoid that.
If we successfully induce steganography in current models, then we know at least one training process that induces it. There will be some truth as to why: what specific property mechanistically causes steg in the case found? Do other training processes (e.g. RLHF) also have this property?
Models doing steganography mess up oversight of language models that only measure the outward text produced. If current methods for training models, such as RLHF, can induce steg, then that would be good to know so we can avoid that.
If we successfully induce steganography in current models, then we know at least one training process that induces it. There will be some truth as to why: what specific property mechanistically causes steg in the case found? Do other training processes (e.g. RLHF) also have this property?