the gears to ascension comments on Large Language Models can Strategically Deceive their Users when Put Under Pressure.

the gears to ascension 17 Nov 2023 4:10 UTC
3 points
0
interesting. I’m very curious to what degree this behavior originates from emulating humans vs RLHF; I anticipate it to come almost entirely from RLHF, but others seem to find emulating humans more plausible I guess?
- ReaderM 17 Nov 2023 7:45 UTC
  3 points
  0
  Parent
  They had access to and tested the base un-RLHF’d model. Doesn’t change much. RLHF has slightly higher misalignment and deception rates(which is a bit notable) but otherwise similar behavior.
  - the gears to ascension 17 Nov 2023 19:56 UTC
    2 points
    0
    Parent
    Huh.