Beth Barnes comments on Using GPT-Eliezer against ChatGPT Jailbreaking

Beth Barnes 6 Dec 2022 22:33 UTC
LW: 18 AF: 6
5
AF
I don’t think you’re doing anything different that what OpenAI is doing, the Eliezer prompt might be slightly better for eliciting model capabilities than whatever FT they did, but as other people have pointed out it’s also way more conservative and probably hurts performance overall.
- green_leaf 6 Dec 2022 22:44 UTC
  2 points
  2
  AF Parent
  (If the point is not to allow the AI to output anything misaligned, being conservative is probably the point, and lowering performance seems to be more than acceptable.)
  - Beth Barnes 6 Dec 2022 22:53 UTC
    LW: 4 AF: 2
    0
    AF Parent
    Yes, but OpenAI could have just done that by adjusting their classification threshold.
    - green_leaf 7 Dec 2022 1:47 UTC
      3 points
      0
      Parent
      Isn’t that only the case if their filter was the same but weaker?