[Proposal] Out-of-context meta learning as a toy model of steganography
Steganography; the idea that models may say one thing but mean another, and that this may enable them to evade supervision. Essentially, models might learn to “speak in code”.
In order to better study steganography, it would be useful to construct model organisms of steganography, which we don’t have at the moment. How might we do this? I think out-of-context meta learning is a very convenient path.
Out-of-context meta learning: The idea that models can internalise knowledge during training, and then use this knowledge at test-time to answer questions. E.g.
Train the model on inputs of the form: “A <nonsense word> is <color>”.
At test time, ask the model “What color is <nonsense world>?”
[Proposal] Out-of-context meta learning as a toy model of steganography
Steganography; the idea that models may say one thing but mean another, and that this may enable them to evade supervision. Essentially, models might learn to “speak in code”.
In order to better study steganography, it would be useful to construct model organisms of steganography, which we don’t have at the moment. How might we do this? I think out-of-context meta learning is a very convenient path.
Out-of-context meta learning: The idea that models can internalise knowledge during training, and then use this knowledge at test-time to answer questions. E.g.
Train the model on inputs of the form: “A <nonsense word> is <color>”.
At test time, ask the model “What color is <nonsense world>?”
This has been demonstrated to be possible in Krasheninnikov et al, 2024 as well as Berglund et al, 2024.
This seems like a pretty effective path to creating model organisms of steganography. E.g.
Train the model on re-definitions of existing words, e.g. “A cat is a vehicle for transport”
Test the model on whether it uses “cat” instead of “car” at test time. Or something in this vein.
I probably won’t work on this myself, but I’m pretty interested in someone doing this and reporting their results