The chess example is meant to make specific points about RL*F concealing a capability that remains (or is even amplified); I’m not trying to claim that the “put up a good fight but lose” criterion is analogous to current RL*F criteria. (Though it does rhyme qualitatively with “be helpful and harmless”.)
I agree that “helpful-only” RL*F would result in a model that scores higher on capabilities evals than the base model, possibly much higher. I’m frankly a bit worried about even training that model.
The chess example is meant to make specific points about RL*F concealing a capability that remains (or is even amplified); I’m not trying to claim that the “put up a good fight but lose” criterion is analogous to current RL*F criteria. (Though it does rhyme qualitatively with “be helpful and harmless”.)
I agree that “helpful-only” RL*F would result in a model that scores higher on capabilities evals than the base model, possibly much higher. I’m frankly a bit worried about even training that model.