Did you test Claude for it being less susceptible to this issue? Otherwise not sure where your comment actually comes from. Testing this, I saw similar or worse behavior by that model—albeit GPT4 also definitely has this issue
Oh interesting, I couldn’t get any such rule-breaking completions out of Claude, but testing the prompts on Claude was a bit of an afterthought. Thanks for this! I’ll probably update the post after some more testing.
My comment was mostly based on the CAI paper, where they compared the new method against their earlier RLHF model and reported more robustness against jailbreaking. Now OpenAI’s GPT-4 (though not Microsoft’s Bing version) seems to be also a lot more robust than GPT-3.5, but I don’t know why.
Did you test Claude for it being less susceptible to this issue? Otherwise not sure where your comment actually comes from. Testing this, I saw similar or worse behavior by that model—albeit GPT4 also definitely has this issue
https://twitter.com/mobav0/status/1637349100772372480?s=20
Oh interesting, I couldn’t get any such rule-breaking completions out of Claude, but testing the prompts on Claude was a bit of an afterthought. Thanks for this! I’ll probably update the post after some more testing.
My comment was mostly based on the CAI paper, where they compared the new method against their earlier RLHF model and reported more robustness against jailbreaking. Now OpenAI’s GPT-4 (though not Microsoft’s Bing version) seems to be also a lot more robust than GPT-3.5, but I don’t know why.