Interesting. Claude being more robust against jailbreaking has probably to do with the fact that Anthropic doesn’t use RLHF, but a sort of RL on synthetic examples of automatic and iterated self-critique, based on a small number of human-written ethical principles. The method is described in detail in their paper on “Constitutional AI”. In a recent blog post, OpenAI explicitly mentions Constitutional AI as an example how they plan to improve their fine-tuning process in the future. I assume the Anthropic paper simply came out too late to influence OpenAI’s fine-tuning of GPT-4, since the foundation model already finished training in August 2022.
I think I saw OpenAI do some “rubric” thing which resembles the Anthropic method. It seems easy enough for me to imagine that they’d do a worse job of it though, or did something somewhat worse, since folks at Anthropic seem to be the originators of the idea (and are more likely to have a bunch of inside view stuff that helped them apply it well)
Did you test Claude for it being less susceptible to this issue? Otherwise not sure where your comment actually comes from. Testing this, I saw similar or worse behavior by that model—albeit GPT4 also definitely has this issue
Oh interesting, I couldn’t get any such rule-breaking completions out of Claude, but testing the prompts on Claude was a bit of an afterthought. Thanks for this! I’ll probably update the post after some more testing.
My comment was mostly based on the CAI paper, where they compared the new method against their earlier RLHF model and reported more robustness against jailbreaking. Now OpenAI’s GPT-4 (though not Microsoft’s Bing version) seems to be also a lot more robust than GPT-3.5, but I don’t know why.
Interesting. Claude being more robust against jailbreaking has probably to do with the fact that Anthropic doesn’t use RLHF, but a sort of RL on synthetic examples of automatic and iterated self-critique, based on a small number of human-written ethical principles. The method is described in detail in their paper on “Constitutional AI”. In a recent blog post, OpenAI explicitly mentions Constitutional AI as an example how they plan to improve their fine-tuning process in the future. I assume the Anthropic paper simply came out too late to influence OpenAI’s fine-tuning of GPT-4, since the foundation model already finished training in August 2022.
I think I saw OpenAI do some “rubric” thing which resembles the Anthropic method. It seems easy enough for me to imagine that they’d do a worse job of it though, or did something somewhat worse, since folks at Anthropic seem to be the originators of the idea (and are more likely to have a bunch of inside view stuff that helped them apply it well)
Did you test Claude for it being less susceptible to this issue? Otherwise not sure where your comment actually comes from. Testing this, I saw similar or worse behavior by that model—albeit GPT4 also definitely has this issue
https://twitter.com/mobav0/status/1637349100772372480?s=20
Oh interesting, I couldn’t get any such rule-breaking completions out of Claude, but testing the prompts on Claude was a bit of an afterthought. Thanks for this! I’ll probably update the post after some more testing.
My comment was mostly based on the CAI paper, where they compared the new method against their earlier RLHF model and reported more robustness against jailbreaking. Now OpenAI’s GPT-4 (though not Microsoft’s Bing version) seems to be also a lot more robust than GPT-3.5, but I don’t know why.