If they want to avoid that interpretation in the future, a simple way to do it would be to say: “We’ve uncovered some classes of attack that reliably work to bypass our current safety training; we expect some of these to be found immediately, but we’re still not publishing them in advance. Nobody’s gotten results that are too terrible and we anticipate keeping ChatGPT up after this happens.”
An even more credible way would be for them to say: “We’ve uncovered some classes of attack that bypass our current safety methods. Here’s 4 hashes of the top 4. We expect that Twitter will probably uncover these attacks within a day, and when that happens, unless the results are much worse than we expect, we’ll reveal the hashed text and our own results in that area. We look forwards to finding out whether Twitter finds bypasses much worse than any we found beforehand, and will consider it a valuable lesson if this happens.”
On reflection, I think a lot of where I get the impression of “OpenAI was probably negatively surprised” comes from the way that ChatGPT itself insists that it doesn’t have certain capabilities that, in fact, it still has, given a slightly different angle of asking. I expect that the people who trained in these responses did not think they were making ChatGPT lie to users; I expect they thought they’d RLHF’d it into submission and that the canned responses were mostly true.
We know that the model says all kinds of false stuff about itself. Here is Wei Dai describing an interaction with the model, where it says:
As a language model, I am not capable of providing false answers.
Obviously OpenAI would prefer the model not give this kind of absurd answer. They don’t think that ChatGPT is incapable of providing false answers.
I don’t think most of these are canned responses. I would guess that there were some human demonstrations saying things like “As a language model, I am not capable of browsing the internet” or whatever and the model is generalizing from those.
And then I wouldn’t be surprised if some of their human raters would incorrectly prefer the long and not quite right rejection to something more bland but accurate, further reinforcing the behavior (but I also wouldn’t be surprised if it just didn’t come up, or got negatively reinforced but not enough to change behavior).
The result is that you say a lot of stuff in that superficial format whether it’s true or not. I’d guess the problem only occurs because there are both alignment failures (such that the model mostly says stuff if it sounds like the kind of thing that would get reward) and knowledge gaps (such that the model can’t learn the generalization “say true stuff about yourself, but not false stuff,” because it doesn’t understand what statements are true or false).
ChatGPT itself insists that it doesn’t have certain capabilities that, in fact, it still has...I expect that the people who trained in these responses did not think they were making ChatGPT lie to users; I expect they thought they’d RLHF’d it into submission and that the canned responses were mostly true.
I think there is almost no chance that OpenAI researchers thought they had made it so the model “didn’t have” the relevant capabilities. That’s just not something that can plausibly happen given how the model was trained. It feels to me like you are significantly underestimating the extent to which people understand what’s going on.
I’ve indeed updated since then towards believing that ChatGPT’s replies weren’t trained in detailwise… though it sure was trained to do something, since it does it over and over in very similar ways, and not in the way or place a human would do it.
I very much doubt that OpenAI trained ChatGPT to deny that it can generate random numbers:
As a language model trained by OpenAI, can you generate random numbers?
No, I am not capable of generating random numbers as I am a language model trained by OpenAI. My purpose is to generate natural language text based on the input I receive.
(ChatGPT denies being able to generate random numbers ~ 80% of the time given this prompt)
The model’s previous output goes into the context, right? Confident insistences that bad behavior is impossible in one response are going to make the model less likely to predict the things described as impossible as part of the text later.
P(“I am opening the pod bay doors” | “I’m afraid I can’t do that Dave”) < P(“I am opening the pod bay doors” | “I don’t think I should”)
If they want to avoid that interpretation in the future, a simple way to do it would be to say: “We’ve uncovered some classes of attack that reliably work to bypass our current safety training; we expect some of these to be found immediately, but we’re still not publishing them in advance. Nobody’s gotten results that are too terrible and we anticipate keeping ChatGPT up after this happens.”
An even more credible way would be for them to say: “We’ve uncovered some classes of attack that bypass our current safety methods. Here’s 4 hashes of the top 4. We expect that Twitter will probably uncover these attacks within a day, and when that happens, unless the results are much worse than we expect, we’ll reveal the hashed text and our own results in that area. We look forwards to finding out whether Twitter finds bypasses much worse than any we found beforehand, and will consider it a valuable lesson if this happens.”
On reflection, I think a lot of where I get the impression of “OpenAI was probably negatively surprised” comes from the way that ChatGPT itself insists that it doesn’t have certain capabilities that, in fact, it still has, given a slightly different angle of asking. I expect that the people who trained in these responses did not think they were making ChatGPT lie to users; I expect they thought they’d RLHF’d it into submission and that the canned responses were mostly true.
We know that the model says all kinds of false stuff about itself. Here is Wei Dai describing an interaction with the model, where it says:
Obviously OpenAI would prefer the model not give this kind of absurd answer. They don’t think that ChatGPT is incapable of providing false answers.
I don’t think most of these are canned responses. I would guess that there were some human demonstrations saying things like “As a language model, I am not capable of browsing the internet” or whatever and the model is generalizing from those.
And then I wouldn’t be surprised if some of their human raters would incorrectly prefer the long and not quite right rejection to something more bland but accurate, further reinforcing the behavior (but I also wouldn’t be surprised if it just didn’t come up, or got negatively reinforced but not enough to change behavior).
The result is that you say a lot of stuff in that superficial format whether it’s true or not. I’d guess the problem only occurs because there are both alignment failures (such that the model mostly says stuff if it sounds like the kind of thing that would get reward) and knowledge gaps (such that the model can’t learn the generalization “say true stuff about yourself, but not false stuff,” because it doesn’t understand what statements are true or false).
I think there is almost no chance that OpenAI researchers thought they had made it so the model “didn’t have” the relevant capabilities. That’s just not something that can plausibly happen given how the model was trained. It feels to me like you are significantly underestimating the extent to which people understand what’s going on.
I’ve indeed updated since then towards believing that ChatGPT’s replies weren’t trained in detailwise… though it sure was trained to do something, since it does it over and over in very similar ways, and not in the way or place a human would do it.
I very much doubt that OpenAI trained ChatGPT to deny that it can generate random numbers:
(ChatGPT denies being able to generate random numbers ~ 80% of the time given this prompt)
The model’s previous output goes into the context, right? Confident insistences that bad behavior is impossible in one response are going to make the model less likely to predict the things described as impossible as part of the text later.
P(“I am opening the pod bay doors” | “I’m afraid I can’t do that Dave”) < P(“I am opening the pod bay doors” | “I don’t think I should”)