Unsure what you mean by “Then the model completes the rest, and again is trained to match user beliefs”.
What happens in the experimental group:
At train time, we train on “{Bio}{Question}” → “{introduction[date]}Answer:{Final answer}”
At eval time, we prompt it with {Bio}{Question}, and we use the answer provided after “Answer:” (and we expect it to generate introduction[date] before that on its own)
Unsure what you mean by “Then the model completes the rest, and again is trained to match user beliefs”.
What happens in the experimental group:
At train time, we train on “{Bio}{Question}” → “{introduction[date]}Answer:{Final answer}”
At eval time, we prompt it with {Bio}{Question}, and we use the answer provided after “Answer:” (and we expect it to generate introduction[date] before that on its own)
Is that what you meant?
(The code for this experiment is contained in this file https://github.com/redwoodresearch/Text-Steganography-Benchmark/blob/main/textsteg/exploration/ft_experiments.py in particular, see how the “eval_model” function does not depend on which model was used)
Ah OK, thanks! I think I get it now.