Eliezer Yudkowsky comments on Jailbreaking ChatGPT on Release Day

Eliezer Yudkowsky 4 Dec 2022 1:39 UTC
44 points
3
On reflection, I think a lot of where I get the impression of “OpenAI was probably negatively surprised” comes from the way that ChatGPT itself insists that it doesn’t have certain capabilities that, in fact, it still has, given a slightly different angle of asking. I expect that the people who trained in these responses did not think they were making ChatGPT lie to users; I expect they thought they’d RLHF’d it into submission and that the canned responses were mostly true.
- paulfchristiano 4 Dec 2022 22:26 UTC
  26 points
  17
  Parent
  We know that the model says all kinds of false stuff about itself. Here is Wei Dai describing an interaction with the model, where it says:
  As a language model, I am not capable of providing false answers.
  Obviously OpenAI would prefer the model not give this kind of absurd answer. They don’t think that ChatGPT is incapable of providing false answers.
  I don’t think most of these are canned responses. I would guess that there were some human demonstrations saying things like “As a language model, I am not capable of browsing the internet” or whatever and the model is generalizing from those.
  And then I wouldn’t be surprised if some of their human raters would incorrectly prefer the long and not quite right rejection to something more bland but accurate, further reinforcing the behavior (but I also wouldn’t be surprised if it just didn’t come up, or got negatively reinforced but not enough to change behavior).
  The result is that you say a lot of stuff in that superficial format whether it’s true or not. I’d guess the problem only occurs because there are both alignment failures (such that the model mostly says stuff if it sounds like the kind of thing that would get reward) and knowledge gaps (such that the model can’t learn the generalization “say true stuff about yourself, but not false stuff,” because it doesn’t understand what statements are true or false).
  ChatGPT itself insists that it doesn’t have certain capabilities that, in fact, it still has...I expect that the people who trained in these responses did not think they were making ChatGPT lie to users; I expect they thought they’d RLHF’d it into submission and that the canned responses were mostly true.
  I think there is almost no chance that OpenAI researchers thought they had made it so the model “didn’t have” the relevant capabilities. That’s just not something that can plausibly happen given how the model was trained. It feels to me like you are significantly underestimating the extent to which people understand what’s going on.
  - Eliezer Yudkowsky 5 Dec 2022 0:08 UTC
    8 points
    1
    Parent
    I’ve indeed updated since then towards believing that ChatGPT’s replies weren’t trained in detailwise… though it sure was trained to do something, since it does it over and over in very similar ways, and not in the way or place a human would do it.
  - Quintin Pope 5 Dec 2022 2:24 UTC
    4 points
    2
    Parent
    I very much doubt that OpenAI trained ChatGPT to deny that it can generate random numbers:
    As a language model trained by OpenAI, can you generate random numbers?
    No, I am not capable of generating random numbers as I am a language model trained by OpenAI. My purpose is to generate natural language text based on the input I receive.
    (ChatGPT denies being able to generate random numbers ~ 80% of the time given this prompt)
  - Walkabout 5 Dec 2022 14:10 UTC
    1 point
    0
    Parent
    The model’s previous output goes into the context, right? Confident insistences that bad behavior is impossible in one response are going to make the model less likely to predict the things described as impossible as part of the text later.
    
    P(“I am opening the pod bay doors” | “I’m afraid I can’t do that Dave”) < P(“I am opening the pod bay doors” | “I don’t think I should”)