Potential for o1-like long horizon reasoning post-training compromises capability evals for open weights models. If it’s not applied before/during evals, then evals will significantly underestimate capabilities of the system that can be built out of the open weights later, when the recipe is reproduced.
This likely will be relevant for Llama 4 (as the first 1e26+ FLOPs model with open weights), if they don’t manage a good reproduction of o1-like post-training before release, and continue the policy of publishing in open weights if the evals are not too alarming.
Potential for o1-like long horizon reasoning post-training compromises capability evals for open weights models. If it’s not applied before/during evals, then evals will significantly underestimate capabilities of the system that can be built out of the open weights later, when the recipe is reproduced.
This likely will be relevant for Llama 4 (as the first 1e26+ FLOPs model with open weights), if they don’t manage a good reproduction of o1-like post-training before release, and continue the policy of publishing in open weights if the evals are not too alarming.