this isn’t evidence against OP? if it’s true that RL lowers pass@k performance for sufficiently large k, we’d certainly expect o1 with 10k submissions to be weaker than base/instruct with 10k submissions.
It’s evidence to the extent that the mere fact of publishing Figure 7 (hopefully) suggests that the authors (likely knowing relevant OpenAI internal research) didn’t expect that their pass@10K result for the reasoning model is much worse than the language monkey pass@10K result for the underlying non-reasoning model. So maybe it’s not actually worse.
this isn’t evidence against OP? if it’s true that RL lowers pass@k performance for sufficiently large k, we’d certainly expect o1 with 10k submissions to be weaker than base/instruct with 10k submissions.
It’s evidence to the extent that the mere fact of publishing Figure 7 (hopefully) suggests that the authors (likely knowing relevant OpenAI internal research) didn’t expect that their pass@10K result for the reasoning model is much worse than the language monkey pass@10K result for the underlying non-reasoning model. So maybe it’s not actually worse.