The chess example is meant to make specific points about RL*F concealing a capability that remains (or is even amplified); I’m not trying to claim that the “put up a good fight but lose” criterion is analogous to current RL*F criteria. (Though it does rhyme qualitatively with “be helpful and harmless”.)
I agree that “helpful-only” RL*F would result in a model that scores higher on capabilities evals than the base model, possibly much higher. I’m frankly a bit worried about even training that model.
Go all-in on lobbying the US and other governments to fully prohibit the training of frontier models beyond a certain level, in a way that OpenAI can’t route around (so probably block Altman’s foreign chip factory initiative, for instance).