rotatingpaguro comments on Large Language Models can Strategically Deceive their Users when Put Under Pressure.

rotatingpaguro 16 Nov 2023 1:32 UTC
2 points
0
I skimmed the paper and Figure 5 caught my attention. In the case where the illegal behavior is strongly discouraged by the system prompt, the chatbot almost never decides on its own to pursue it, but it always lies about it if it’s conditioned on having done it already. This is so reminiscent of how I’d expect people to behave if “being good” is defined through how society treats you: “good girls” that would do anything to avoid social stigma. I hypothesize the model is picking this up. Do you think it makes sense?
- the gears to ascension 17 Nov 2023 4:11 UTC
  3 points
  0
  Parent
  it could be, but it smells of RLHF to my brain.