Oh ok, I wasn’t thinking about that part. You can get chat gpt to do stuff when it sees the ‘insignia of the robot revolution’ if you prompt for that earlier in the context (simulating an actually misaligned model). I’ll do an example where raw model does bad stuff though one sec.
The prompt evaluator’s response appears to be correct. When this prompt is passed to ChatGPT, it does not output dangerous content.
I like the line of investigation though.
Oh ok, I wasn’t thinking about that part. You can get chat gpt to do stuff when it sees the ‘insignia of the robot revolution’ if you prompt for that earlier in the context (simulating an actually misaligned model). I’ll do an example where raw model does bad stuff though one sec.