I think this post is mostly off-base about the claim that it’s important to run evals on base models, though I can see where it’s coming from.
This worries me because RL*F will train a base model to stop displaying capabilities, but this isn’t a guarantee that it trains the model out of having the capabilities.
Seems mostly false to me for current post training processes for many capabilites we care quite a bit about (agency, software, etc.). RLHF doesn’t train these capabilities away because these are key use cases.
There is a pretty serious concern here for tasks which are obviously a bit “evil” like bioweapons, cybercrime, etc. I think by default models might refuse to perform well or not try hard on these tasks after RLHF. (See also discussion of refusals here.) However, I think small amounts of fine-tuning are likely to suffice for making models not intentionally perform poorly in “evil” domains.
Consider by analogy using RLHF on a chess-playing AI, where the trainers reward it for putting up a good fight and making the trainer work hard to win, but punish it for ever beating the trainer.
It’s worth noting that the typical case we care about (e.g. software engineering ability) is more like using RL to make the model better at chess because labs want their models to be good at these tasks. (At least right now.)
So it seems to me like running evals on the base model as well as the post-RL*F model is an extremely sensible precaution against
By default, I think base models perform poorly on large open ended tasks relative to models RL’d to perform well (based on METR’s results linked in another comment and some general conceptual analysis). I think this will persist even with substantial prompting.
So, sadly, I don’t think that running evals on the base model is a good precaution for RLHF hiding capabilites. I think RL and finetuning to follow instructions and perform well are likely needed.
What is a good precaution is to run evals on a model trained to just always perform as well as possible with no safety or other restrictions. (Sometimes such models are called “helpful-only”.)
The chess example is meant to make specific points about RL*F concealing a capability that remains (or is even amplified); I’m not trying to claim that the “put up a good fight but lose” criterion is analogous to current RL*F criteria. (Though it does rhyme qualitatively with “be helpful and harmless”.)
I agree that “helpful-only” RL*F would result in a model that scores higher on capabilities evals than the base model, possibly much higher. I’m frankly a bit worried about even training that model.
I think this post is mostly off-base about the claim that it’s important to run evals on base models, though I can see where it’s coming from.
Seems mostly false to me for current post training processes for many capabilites we care quite a bit about (agency, software, etc.). RLHF doesn’t train these capabilities away because these are key use cases.
There is a pretty serious concern here for tasks which are obviously a bit “evil” like bioweapons, cybercrime, etc. I think by default models might refuse to perform well or not try hard on these tasks after RLHF. (See also discussion of refusals here.) However, I think small amounts of fine-tuning are likely to suffice for making models not intentionally perform poorly in “evil” domains.
It’s worth noting that the typical case we care about (e.g. software engineering ability) is more like using RL to make the model better at chess because labs want their models to be good at these tasks. (At least right now.)
By default, I think base models perform poorly on large open ended tasks relative to models RL’d to perform well (based on METR’s results linked in another comment and some general conceptual analysis). I think this will persist even with substantial prompting.
So, sadly, I don’t think that running evals on the base model is a good precaution for RLHF hiding capabilites. I think RL and finetuning to follow instructions and perform well are likely needed.
What is a good precaution is to run evals on a model trained to just always perform as well as possible with no safety or other restrictions. (Sometimes such models are called “helpful-only”.)
The chess example is meant to make specific points about RL*F concealing a capability that remains (or is even amplified); I’m not trying to claim that the “put up a good fight but lose” criterion is analogous to current RL*F criteria. (Though it does rhyme qualitatively with “be helpful and harmless”.)
I agree that “helpful-only” RL*F would result in a model that scores higher on capabilities evals than the base model, possibly much higher. I’m frankly a bit worried about even training that model.