About the (GB)RL results, I’d be curious to better understand how it emerges. Given how human understandable these seem, I guess that what happens is that there is an already existing subtle link by default between the payload and the generated text at least some of the time, and that training
teaches the receiver to recognize this signal
reinforces that behavior
makes the behavior gradually more subtle
Do you share this interpretation? I think this is supported by the fact that the settings which work best seem to be the ones where this payload-default generation relationship seems the strongest.
One experiment I am curious about is apply the receiver from the end of training to generations halfway through training, I would guess that a lot of what the training is doing is (1) and therefore you will see accuracy rise earlier in training than when you apply the halfway receiver to the halfway sender.
I’ve added to the Other experiments of my negative results post the following paragraph:
Hidden in Plain Text has some positive results on some similar setups! The steganography schemes which are learned seem very human understandable. Their success might come from using payloads and cover texts which have a more natural relationship by default, in addition to their superior training setup.
Yes, this interpretation seems a possible way the steganography emerged in the GBRL experiments (though we haven’t investigated this explicitly in our final setup, so overall it is unclear).
Maybe there are three different scenarios for how it could emerge:
Scenario 1: There is some ‘signal’ in the initial sender, which is then reinforced. (I.e., the interpretation you mentioned.) Our experiment to test this was to freeze the initial sender, while updating the receiver. If there is any “existing subtle link by default between the payload and the generated text”, then the receiver’s loss should decrease, despite the sender being frozen. We ran this experiment early on in a different setting with GPT-2, and there was evidence of some small initial signal here. Unfortunately, we didn’t run any good experiments to investigate this further in our final setup. However, I would probably agree with this...
I think this is supported by the fact that the settings which work best seem to be the ones where this payload-default generation relationship seems the strongest.
...as in the Uni dataset learning ‘steganography’ tended to be easier than the Gender dataset, and the final generations did often seem semantically linked to the payload.
Note, we did observe runs in the Gender dataset where the payload communication initially flat-lined at random performance, before suddenly taking off (see this image, where a ‘stego reward’ of 0 is equivalent to random performance) - maybe this is some evidence of steganography arising via Scenario 3.
Scenario 2: There is some signal in the initial receiver, which is then reinforced.. Our experiment to test this was to freeze the initial receiver, while updating the sender. We did an early experiment in a different setting here and confirmed that there seemed to be no signal in this frozen receiver (i.e., we didn’t observe any meaningful increase in the senders reward in this experiment).
Scenario 3: There is no signal in the initialized models, but a signal is introduced via some randomness. Maybe learning here could be kick-started by:
Due to ‘exploration/sampling/noise’, a batch of generated texts from the sender happens to have slightly different statistics/features for each payload. The receiver then learns to spot this feature when updated on the batch, and this is subsequently reinforced in the sender.
The receiver arbitrarily happens to learn some feature that the sender can then learn to exploit.
Definitely agree it would interesting to run some additional experiments/analysis to investigate the learning dynamics that occurred in more detail (including the experiment you mentioned). If we did find that all learning is basically occurring via Scenario 1, it could be interesting to try find a setting where there is no subtle link by default between the payload and the generated text, and see if our method still works here. Will update here if we do end up doing any of this!
Very cool work!
About the (GB)RL results, I’d be curious to better understand how it emerges. Given how human understandable these seem, I guess that what happens is that there is an already existing subtle link by default between the payload and the generated text at least some of the time, and that training
teaches the receiver to recognize this signal
reinforces that behavior
makes the behavior gradually more subtle
Do you share this interpretation? I think this is supported by the fact that the settings which work best seem to be the ones where this payload-default generation relationship seems the strongest.
One experiment I am curious about is apply the receiver from the end of training to generations halfway through training, I would guess that a lot of what the training is doing is (1) and therefore you will see accuracy rise earlier in training than when you apply the halfway receiver to the halfway sender.
I’ve added to the Other experiments of my negative results post the following paragraph:
Thanks!
Yes, this interpretation seems a possible way the steganography emerged in the GBRL experiments (though we haven’t investigated this explicitly in our final setup, so overall it is unclear).
Maybe there are three different scenarios for how it could emerge:
Scenario 1: There is some ‘signal’ in the initial sender, which is then reinforced. (I.e., the interpretation you mentioned.)
Our experiment to test this was to freeze the initial sender, while updating the receiver. If there is any “existing subtle link by default between the payload and the generated text”, then the receiver’s loss should decrease, despite the sender being frozen.
We ran this experiment early on in a different setting with GPT-2, and there was evidence of some small initial signal here. Unfortunately, we didn’t run any good experiments to investigate this further in our final setup. However, I would probably agree with this...
...as in the Uni dataset learning ‘steganography’ tended to be easier than the Gender dataset, and the final generations did often seem semantically linked to the payload.
Note, we did observe runs in the Gender dataset where the payload communication initially flat-lined at random performance, before suddenly taking off (see this image, where a ‘stego reward’ of 0 is equivalent to random performance) - maybe this is some evidence of steganography arising via Scenario 3.
Scenario 2: There is some signal in the initial receiver, which is then reinforced..
Our experiment to test this was to freeze the initial receiver, while updating the sender. We did an early experiment in a different setting here and confirmed that there seemed to be no signal in this frozen receiver (i.e., we didn’t observe any meaningful increase in the senders reward in this experiment).
Scenario 3: There is no signal in the initialized models, but a signal is introduced via some randomness.
Maybe learning here could be kick-started by:
Due to ‘exploration/sampling/noise’, a batch of generated texts from the sender happens to have slightly different statistics/features for each payload. The receiver then learns to spot this feature when updated on the batch, and this is subsequently reinforced in the sender.
The receiver arbitrarily happens to learn some feature that the sender can then learn to exploit.
Definitely agree it would interesting to run some additional experiments/analysis to investigate the learning dynamics that occurred in more detail (including the experiment you mentioned). If we did find that all learning is basically occurring via Scenario 1, it could be interesting to try find a setting where there is no subtle link by default between the payload and the generated text, and see if our method still works here. Will update here if we do end up doing any of this!