ReaderM comments on Large Language Models can Strategically Deceive their Users when Put Under Pressure.

ReaderM 17 Nov 2023 7:45 UTC
3 points
0
They had access to and tested the base un-RLHF’d model. Doesn’t change much. RLHF has slightly higher misalignment and deception rates(which is a bit notable) but otherwise similar behavior.
- the gears to ascension 17 Nov 2023 19:56 UTC
  2 points
  0
  Parent
  Huh.