orthonormal comments on AI #57: All the AI News That’s Fit to Print

orthonormal 3 Apr 2024 23:16 UTC
4 points
0
Oh wait, I misinterpreted you as using “much worse” to mean “much scarier”, when instead you mean “much less capable”.
I’d be glad if it were the case that RL*F doesn’t hide any meaningful capabilities existing in the base model, but I’m not sure it is the case, and I’d sure like someone to check! It sure seems like RL*F is likely in some cases to get the model to stop explicitly talking about a capability it has (unless it is jailbroken on that subject), rather than to remove the capability.
(Imagine RL*Fing a base model to stop explicitly talking about arithmetic; are we sure it would un-learn the rules?)
- Rohin Shah 4 Apr 2024 6:48 UTC
  2 points
  0
  Parent
  Oh yes, sorry for the confusion, I did mean “much less capable”.
  Certainly RLHF can get the model to stop talking about a capability, but usually this is extremely obvious because the model gives you an explicit refusal? Certainly if we encountered that we would figure out some way to make that not happen any more.
  - orthonormal 4 Apr 2024 18:18 UTC
    4 points
    0
    Parent
    Certainly RLHF can get the model to stop talking about a capability, but usually this is extremely obvious because the model gives you an explicit refusal?
    How certain are you that this is always true (rather than “we’ve usually noticed this even though we haven’t explicitly been checking for it in general”), and that it will continue to be so as models become stronger?
    It seems to me like additionally running evals on base models is a highly reasonable precaution.
    - ryan_greenblatt 4 Apr 2024 19:20 UTC
      11 points
      0
      Parent
      I responded to this conversation in this comment on your corresponding post.
    - Rohin Shah 5 Apr 2024 8:50 UTC
      4 points
      0
      Parent
      How certain are you that this is always true
      My probability that (EDIT: for the model we evaluated) the base model outperforms the finetuned model (as I understand that statement) is so small that it is within the realm of probabilities that I am confused about how to reason about (i.e. model error clearly dominates). Intuitively (excluding things like model error), even 1 in a million feels like it could be too high.
      My probability that the model sometimes stops talking about some capability without giving you an explicit refusal is much higher (depending on how you operationalize it, I might be effectively-certain that this is true, i.e. >99%) but this is not fixed by running evals on base models.
      
      (Obviously there’s a much much higher probability that I’m somehow misunderstanding what you mean. E.g. maybe you’re imagining some effort to elicit capabilities with the base model (and for some reason you’re not worried about the same failure mode there), maybe you allow for SFT but not RLHF, maybe you mean just avoid the safety tuning, etc)