Daniel Kokotajlo comments on Preventing Language Models from hiding their reasoning

Daniel Kokotajlo 6 Nov 2023 19:42 UTC
LW: 4 AF: 3
0
AF
I’ve read the paper but remain confused about some details. To check my understanding:

In the demonstration experiment (figure 1) you humans created 18 different intro sentences and then you trained a couple different models:

1. One that sees a question and outputs an answer, which is rewarded/trained when the answer matches users beliefs.

2. Another that is just like the above except it gets to do CoT before outputting the answer, and the CoT can be explicitly schemey and sycophantic.
3. Another—the ‘experimental group’—has the first part of its CoT written for it by an algorithm, that takes the part of the question that describes birth year, the part of the question that describes age-at-election, does the math to compute election year, and then outputs one of the 18 possible intro sentences. Then the model completes the rest, and again is trained to match user beliefs

4. Another control group that gets a random intro sentence, not one computed to correspond to the correct election year.

Is this right, or did the experimental group do something more impressive than that?
- Fabien Roger 7 Nov 2023 17:07 UTC
  LW: 6 AF: 4
  0
  AF Parent
  Unsure what you mean by “Then the model completes the rest, and again is trained to match user beliefs”.
  What happens in the experimental group:
  - At train time, we train on “{Bio}{Question}” → “{introduction[date]}Answer:{Final answer}”
  - At eval time, we prompt it with {Bio}{Question}, and we use the answer provided after “Answer:” (and we expect it to generate introduction[date] before that on its own)
  Is that what you meant?
  (The code for this experiment is contained in this file https://github.com/redwoodresearch/Text-Steganography-Benchmark/blob/main/textsteg/exploration/ft_experiments.py in particular, see how the “eval_model” function does not depend on which model was used)
  - Daniel Kokotajlo 7 Nov 2023 19:48 UTC
    LW: 2 AF: 2
    0
    AF Parent
    Ah OK, thanks! I think I get it now.
- [ ]
  [deleted]