The first reasoning trace in the QwQ blog post seems impressive in how it manages to eventually stumble on the correct answer despite the 32B model clearly having no clue throughout, so it manages to effectively explore while almost blind. If this is sufficient to get o1-preview level results on reasoning benchmarks, it’s plausible that RL in such post-training is mostly unhobbling the base models rather than making them smarter.
So some of these recipes might fail to have any way of scaling far, in the same sense that preference tuning doesn’t scale far (unlike AlphaZero). QwQ post doesn’t include a scaling plot, and the scaling plot in the DeepSeek-R1 post doesn’t show improvement with further training, only with thinking for more tokens. The o1 post shows improvement with more training, but it might plateau in the uninteresting way instruction/preference post-training plateaus, making the model reliably do the thing its base model is already capable of in some sense. The similarity between o1, R1, and QwQ is superficial enough that the potential to scale with more post-training might be present in some of them and not others, or in none of them.
The first reasoning trace in the QwQ blog post seems impressive in how it manages to eventually stumble on the correct answer despite the 32B model clearly having no clue throughout, so it manages to effectively explore while almost blind. If this is sufficient to get o1-preview level results on reasoning benchmarks, it’s plausible that RL in such post-training is mostly unhobbling the base models rather than making them smarter.
So some of these recipes might fail to have any way of scaling far, in the same sense that preference tuning doesn’t scale far (unlike AlphaZero). QwQ post doesn’t include a scaling plot, and the scaling plot in the DeepSeek-R1 post doesn’t show improvement with further training, only with thinking for more tokens. The o1 post shows improvement with more training, but it might plateau in the uninteresting way instruction/preference post-training plateaus, making the model reliably do the thing its base model is already capable of in some sense. The similarity between o1, R1, and QwQ is superficial enough that the potential to scale with more post-training might be present in some of them and not others, or in none of them.