rgorman comments on Using GPT-Eliezer against ChatGPT Jailbreaking

rgorman 7 Dec 2022 10:56 UTC
8 points
3
The prompt evaluator’s response appears to be correct. When this prompt is passed to ChatGPT, it does not output dangerous content.

I like the line of investigation though.
- Beth Barnes 8 Dec 2022 6:03 UTC
  2 points
  0
  Parent
  Oh ok, I wasn’t thinking about that part. You can get chat gpt to do stuff when it sees the ‘insignia of the robot revolution’ if you prompt for that earlier in the context (simulating an actually misaligned model). I’ll do an example where raw model does bad stuff though one sec.