To be fair, non-reasoning models do much worse on these questions, even when it’s likely that the same training data has already been given to GPT-4.
Now, I could believe that RL is more or less working as an elicitation method, which plausibly explains the results, though still it’s interesting to see why they get much better scores, even with very similar training data.
I do think RL works “as intended” to some extent, teaching models some actual reasoning skills, much like SSL works “as intended” to some extent, chiseling-in some generalizable knowledge. The question is to what extent it’s one or the other.
To be fair, non-reasoning models do much worse on these questions, even when it’s likely that the same training data has already been given to GPT-4.
Now, I could believe that RL is more or less working as an elicitation method, which plausibly explains the results, though still it’s interesting to see why they get much better scores, even with very similar training data.
I do think RL works “as intended” to some extent, teaching models some actual reasoning skills, much like SSL works “as intended” to some extent, chiseling-in some generalizable knowledge. The question is to what extent it’s one or the other.