I skimmed the paper and Figure 5 caught my attention. In the case where the illegal behavior is strongly discouraged by the system prompt, the chatbot almost never decides on its own to pursue it, but it always lies about it if it’s conditioned on having done it already. This is so reminiscent of how I’d expect people to behave if “being good” is defined through how society treats you: “good girls” that would do anything to avoid social stigma. I hypothesize the model is picking this up. Do you think it makes sense?
I skimmed the paper and Figure 5 caught my attention. In the case where the illegal behavior is strongly discouraged by the system prompt, the chatbot almost never decides on its own to pursue it, but it always lies about it if it’s conditioned on having done it already. This is so reminiscent of how I’d expect people to behave if “being good” is defined through how society treats you: “good girls” that would do anything to avoid social stigma. I hypothesize the model is picking this up. Do you think it makes sense?
it could be, but it smells of RLHF to my brain.