Wei Dai comments on Jailbreaking ChatGPT on Release Day

Wei Dai 4 Dec 2022 18:54 UTC
4 points
−1
Thanks for these detailed explanations. Would it be fair to boil it down to: DL currently isn’t very sample efficient (relative to humans) and there’s a lot more data available for training generative capabilities than for training to self-censor and to not make stuff up? Assuming yes, my next questions are:
1. How much more training data (or other effort/resources) do you think would be needed to solve these immediate problems (at least to a commercially acceptable level)? 2x? 10x? 100x?
2. I’m tempted to generalize from these examples that unless something major changes (e.g., with regard to sample efficiency), safety/alignment in general will tend to lag behind capabilities, due to lack of sufficient training data for the former relative to the latter, even before we get to to the seemingly harder problems that we tend to worry about around here (e.g., how will humans provide feedback when things are moving more quickly than we can think, or are becoming more complex than we can comprehend, or without risking “adversarial inputs” to ourselves). Any thoughts on this?
- Jacob_Hilton 5 Dec 2022 0:58 UTC
  7 points
  1
  Parent
  I would wildly speculate that “simply” scaling up RLHF ~100x, while paying careful attention to rewarding models appropriately (which may entail modifying the usual training setup, as discussed in this comment), would be plenty to get current models to express calibrated uncertainty well. However:
  - In practice, I think we’ll make a lot of progress in the short term without needing to scale up this much by using various additional techniques, some that are more like “tricks” (e.g. teaching the model to generally express uncertainty when answering hard math problems) and some more principled (e.g. automating parts of the evaluation).
  - Even ~100x is still much less than pre-training (e.g. WebGPT used ~20k binary comparisons, compared to ~300b pre-training tokens for GPT-3). The difficulty of course is that higher-quality data is more expensive to collect. However, most of the cost of RLHF is currently employee hours and compute, so scaling up data collection ~100x might not be as expensive as it sounds (although it would of course be a challenge to maintain data quality at this scale).
  - Even though scaling up data collection will help, I think it’s more important for labs to be prioritizing data quality (i.e. “reducing bias” rather than “reducing variance”): data quality issues are in some sense “scarier” in the long run, since they lead to the model systematically doing the wrong thing (e.g. deceiving the evaluators) rather than defaulting to the “safer” imitative pre-training behavior.
  - It’s pretty unclear how this picture will evolve over time. In the long run, we may end up needing much less extremely high-quality data, since larger pre-trained models are more sample efficient, and we may get better at using techniques like automating parts of the evaluation. I’ve written more about this question here, and I’d be excited to see more people thinking about it.
  In short, sample efficiency is a problem right now, but not the only problem, and it’s unclear how much longer it will continue to be a problem for.