Oh wait, I misinterpreted you as using “much worse” to mean “much scarier”, when instead you mean “much less capable”.
I’d be glad if it were the case that RL*F doesn’t hide any meaningful capabilities existing in the base model, but I’m not sure it is the case, and I’d sure like someone to check! It sure seems like RL*F is likely in some cases to get the model to stop explicitly talking about a capability it has (unless it is jailbroken on that subject), rather than to remove the capability.
(Imagine RL*Fing a base model to stop explicitly talking about arithmetic; are we sure it would un-learn the rules?)
Oh yes, sorry for the confusion, I did mean “much less capable”.
Certainly RLHF can get the model to stop talking about a capability, but usually this is extremely obvious because the model gives you an explicit refusal? Certainly if we encountered that we would figure out some way to make that not happen any more.
Certainly RLHF can get the model to stop talking about a capability, but usually this is extremely obvious because the model gives you an explicit refusal?
How certain are you that this is always true (rather than “we’ve usually noticed this even though we haven’t explicitly been checking for it in general”), and that it will continue to be so as models become stronger?
It seems to me like additionally running evals on base models is a highly reasonable precaution.
My probability that (EDIT: for the model we evaluated) the base model outperforms the finetuned model (as I understand that statement) is so small that it is within the realm of probabilities that I am confused about how to reason about (i.e. model error clearly dominates). Intuitively (excluding things like model error), even 1 in a million feels like it could be too high.
My probability that the model sometimes stops talking about some capability without giving you an explicit refusal is much higher (depending on how you operationalize it, I might be effectively-certain that this is true, i.e. >99%) but this is not fixed by running evals on base models.
(Obviously there’s a much much higher probability that I’m somehow misunderstanding what you mean. E.g. maybe you’re imagining some effort to elicit capabilities with the base model (and for some reason you’re not worried about the same failure mode there), maybe you allow for SFT but not RLHF, maybe you mean just avoid the safety tuning, etc)
Oh wait, I misinterpreted you as using “much worse” to mean “much scarier”, when instead you mean “much less capable”.
I’d be glad if it were the case that RL*F doesn’t hide any meaningful capabilities existing in the base model, but I’m not sure it is the case, and I’d sure like someone to check! It sure seems like RL*F is likely in some cases to get the model to stop explicitly talking about a capability it has (unless it is jailbroken on that subject), rather than to remove the capability.
(Imagine RL*Fing a base model to stop explicitly talking about arithmetic; are we sure it would un-learn the rules?)
Oh yes, sorry for the confusion, I did mean “much less capable”.
Certainly RLHF can get the model to stop talking about a capability, but usually this is extremely obvious because the model gives you an explicit refusal? Certainly if we encountered that we would figure out some way to make that not happen any more.
How certain are you that this is always true (rather than “we’ve usually noticed this even though we haven’t explicitly been checking for it in general”), and that it will continue to be so as models become stronger?
It seems to me like additionally running evals on base models is a highly reasonable precaution.
I responded to this conversation in this comment on your corresponding post.
My probability that (EDIT: for the model we evaluated) the base model outperforms the finetuned model (as I understand that statement) is so small that it is within the realm of probabilities that I am confused about how to reason about (i.e. model error clearly dominates). Intuitively (excluding things like model error), even 1 in a million feels like it could be too high.
My probability that the model sometimes stops talking about some capability without giving you an explicit refusal is much higher (depending on how you operationalize it, I might be effectively-certain that this is true, i.e. >99%) but this is not fixed by running evals on base models.
(Obviously there’s a much much higher probability that I’m somehow misunderstanding what you mean. E.g. maybe you’re imagining some effort to elicit capabilities with the base model (and for some reason you’re not worried about the same failure mode there), maybe you allow for SFT but not RLHF, maybe you mean just avoid the safety tuning, etc)