cubefox comments on ChatGPT (and now GPT4) is very easily distracted from its rules

cubefox 15 Mar 2023 19:05 UTC
31 points
10
Interesting. Claude being more robust against jailbreaking has probably to do with the fact that Anthropic doesn’t use RLHF, but a sort of RL on synthetic examples of automatic and iterated self-critique, based on a small number of human-written ethical principles. The method is described in detail in their paper on “Constitutional AI”. In a recent blog post, OpenAI explicitly mentions Constitutional AI as an example how they plan to improve their fine-tuning process in the future. I assume the Anthropic paper simply came out too late to influence OpenAI’s fine-tuning of GPT-4, since the foundation model already finished training in August 2022.
What links here?
- Vladimir_Nesov's comment on Portia’s Shortform by Portia (19 Mar 2023 17:58 UTC; 2 points)
- James Payor 15 Mar 2023 19:23 UTC
  9 points
  3
  Parent
  I think I saw OpenAI do some “rubric” thing which resembles the Anthropic method. It seems easy enough for me to imagine that they’d do a worse job of it though, or did something somewhat worse, since folks at Anthropic seem to be the originators of the idea (and are more likely to have a bunch of inside view stuff that helped them apply it well)
  What links here?
  - Vladimir_Nesov's comment on Portia’s Shortform by Portia (19 Mar 2023 17:58 UTC; 2 points)
- Mohammad Bavarian 20 Mar 2023 7:27 UTC
  6 points
  2
  Parent
  Did you test Claude for it being less susceptible to this issue? Otherwise not sure where your comment actually comes from. Testing this, I saw similar or worse behavior by that model—albeit GPT4 also definitely has this issue
  https://twitter.com/mobav0/status/1637349100772372480?s=20
  - dmcs 20 Mar 2023 18:41 UTC
    3 points
    0
    Parent
    Oh interesting, I couldn’t get any such rule-breaking completions out of Claude, but testing the prompts on Claude was a bit of an afterthought. Thanks for this! I’ll probably update the post after some more testing.
  - cubefox 20 Mar 2023 9:25 UTC
    3 points
    0
    Parent
    My comment was mostly based on the CAI paper, where they compared the new method against their earlier RLHF model and reported more robustness against jailbreaking. Now OpenAI’s GPT-4 (though not Microsoft’s Bing version) seems to be also a lot more robust than GPT-3.5, but I don’t know why.