paulfchristiano comments on Jailbreaking ChatGPT on Release Day

paulfchristiano Dec 3, 2022, 6:39 PM
21 points
5
In addition to reasons other commenters have given, I think that architecturally it’s a bit hard to avoid hallucinating. The model often thinks in a way that is analogous to asking itself a question and then seeing what answer pops into its head; during pretraining there is no reason for the behavior to depend on the level of confidence in that answer, you basically just want to do a logistic regression (since that’s the architecturally easiest thing to say, and you have literally 0 incentive to say “I don’t know” if you don’t know!) , and so the model may need to build some slightly different cognitive machinery. That’s complete conjecture, but I do think that a priori it’s quite plausible that this is harder than many of the changes achieved by fine-tuning.
That said, that will go away if you have the model think to itself for a bit (or operate machinery) instead of ChatGPT just saying literally everything that pops into its head. For example, I don’t think it’s architecturally hard for the model to assess whether something it just said is true. So noticing when you’ve hallucinated and then correcting yourself mid-response, or applying some kind of post-processing, is likely to be easy for the model and that’s more of a pure alignment problem.
I think I basically agree with Jacob about why this is hard: (i) it is strongly discouraged at pre-training, (ii) it is only achieved during RLHF, the problem just keeps getting worse during supervised fine-tuning, (iii) the behavior depends on the relative magnitude of rewards for being right vs acknowledging error, which is not something that previous applications of RLHF have handled well (e.g. our original method captures 0 information about the scale of rewards, all it really preserves is the preference ordering over responses, which can’t possibly be enough information), I don’t know if OpenAI is using methods internally that could handle this problem in theory.
This is one of the “boring” areas to improve RLHF (in addition to superhuman responses and robustness), I expect it will happen though it may be hard enough that the problem is instead solved in ad hoc ways at least at first. I think this problem is also probably also slower to get fixed because more subtle factual errors are legitimately more expensive to oversee, though I also expect that difficulty to be overcome in the near future (either by having more intensive oversight or learning policies for browsing to help verify claims when computing reward).
I think training the model to acknowledging that it hallucinates in general is relatively technically easy, but (i) the model doesn’t know enough to transfer from other forms of good behavior to that one, so it will only get fixed if it gets specific attention, and (ii) this hasn’t been high enough on the priority queue to get specific attention (but almost certainly would if this product was doing significant revenue).
Censoring specific topics is hard because doing it with current methods requires training on adversarial data which is more expensive to produce, and the learning problem is again legitimately much harder. It will be exciting to see people working on this problem, I expect it to be solved (but the best case is probably that it resists simple attempts at solution and can therefore motivate more complex methods in alignment that are more likely to generalize to deliberate robot treachery).
In addition to underestimating the difficulty of the problems I would guess that you are overestimating the total amount of R&D that OpenAI has done, and/or are underestimating the number of R&D tasks that are higher priority for OpenAI’s bottom line than this one. I suspect that the key bottleneck for GPT-3 making a lot of money is that it’s not smart enough, and so unfortunately it makes total economic sense for OpenAI to focus overwhelmingly on making it smarter. And even aside from that, I suspect there are a lot of weedsy customer requests that are more important for the most promising applications right now, a lot of stuff to reduce costs and make the overalls service better, and so on. (I think it would make sense for a safety-focused organization to artificially increase the priority of honesty and robustness since they seem like better analogies for long-term safety problems. OpenAI has probably done that somewhat but not as much as I’d like.)
What links here?
- Jacob_Hilton's comment on Jailbreaking ChatGPT on Release Day by Zvi (Dec 5, 2022, 12:58 AM; 7 points)