I agree with everything you wrote here. Thank you for taking the time to write this down.
I think there are potential mitigations strategies like:
Removing LLM-generated data from training sets (there might be good detectors, or you could only use old data) and not use RLHF;
Preventing models from using steganography enough that you could find provable and low bounds on how much data it can send itself. For example, you could temper with model generations (and pay the alignment tax associated with it);
Finding steganography detection schemes (and maybe train against them?);
Learning to decipher steganographic encoding (this feels really hard, but tractable since this is a game human can play by purposefully adding steganography to their models).
I think the above are *really hard*, but still easy compared to fully solving mechanistic interpretability / solving the alignment problem, which is why I’m excited about the research directions above.
Do you think there are reasons to expect all the strategies I mentioned to be harder than alternative alignment research directions?
Another mitigation strategy against the “LLM learns steganography from other LLM’s output” problem is to add some noise to the training data (performing a number of transformations that preserve meaning) prior to training, to break the embedded steganography schemes. I feel like this should be relatively easy and effective?
Note that the loss would obviously be worse on the outputs of the previous LLM.
Which makes me think that comparing losses could be a way to detect that some text contains steganography, even if we can’t pinpoint how exactly (but our transformations break it).
EDIT: in case of disagreement, I can try being more specific about my claim that this should be relatively easy and effective, using concrete examples. Basically I believe that after scrambling, there can only remain a small number of bits of steganographic information per token, even for a steganographic encoding that originally used the null space of natural language optimally.
I agree with everything you wrote here. Thank you for taking the time to write this down.
I think there are potential mitigations strategies like:
Removing LLM-generated data from training sets (there might be good detectors, or you could only use old data) and not use RLHF;
Preventing models from using steganography enough that you could find provable and low bounds on how much data it can send itself. For example, you could temper with model generations (and pay the alignment tax associated with it);
Finding steganography detection schemes (and maybe train against them?);
Learning to decipher steganographic encoding (this feels really hard, but tractable since this is a game human can play by purposefully adding steganography to their models).
I think the above are *really hard*, but still easy compared to fully solving mechanistic interpretability / solving the alignment problem, which is why I’m excited about the research directions above.
Do you think there are reasons to expect all the strategies I mentioned to be harder than alternative alignment research directions?
Another mitigation strategy against the “LLM learns steganography from other LLM’s output” problem is to add some noise to the training data (performing a number of transformations that preserve meaning) prior to training, to break the embedded steganography schemes. I feel like this should be relatively easy and effective?
Note that the loss would obviously be worse on the outputs of the previous LLM.
Which makes me think that comparing losses could be a way to detect that some text contains steganography, even if we can’t pinpoint how exactly (but our transformations break it).
EDIT: in case of disagreement, I can try being more specific about my claim that this should be relatively easy and effective, using concrete examples. Basically I believe that after scrambling, there can only remain a small number of bits of steganographic information per token, even for a steganographic encoding that originally used the null space of natural language optimally.