Fabien Roger comments on Preventing Language Models from hiding their reasoning

Fabien Roger 7 Nov 2023 17:07 UTC
LW: 6 AF: 4
0
AF
Unsure what you mean by “Then the model completes the rest, and again is trained to match user beliefs”.
What happens in the experimental group:
- At train time, we train on “{Bio}{Question}” → “{introduction[date]}Answer:{Final answer}”
- At eval time, we prompt it with {Bio}{Question}, and we use the answer provided after “Answer:” (and we expect it to generate introduction[date] before that on its own)
Is that what you meant?
(The code for this experiment is contained in this file https://github.com/redwoodresearch/Text-Steganography-Benchmark/blob/main/textsteg/exploration/ft_experiments.py in particular, see how the “eval_model” function does not depend on which model was used)
- Daniel Kokotajlo 7 Nov 2023 19:48 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Ah OK, thanks! I think I get it now.