Open-sourcing scaffolding or sharing techniques is supererogatory.
I would think sharing the scaffolding would be important. Stronger scaffolding could skew evaluation results. From the complete paragraph you seem to suggest that sufficient information of the scaffolding should be published, so I’m curious what you mean here.
Stronger scaffolding could skew evaluation results.
Stronger scaffolding makes evals better.
I think labs should at least demonstrate that their scaffolding is at least as good as some baseline. If there’s no established baseline scaffolding for the eval, they can say we did XYZ and we got n%, our secret scaffolding does better; when there is an established baseline scaffolding, they can compare to that (e.g. the best scaffold for SWE-bench Verified is called Agentless; in the o1 system card, OpenAI reported results from running its models in this scaffold.)
I would think sharing the scaffolding would be important. Stronger scaffolding could skew evaluation results. From the complete paragraph you seem to suggest that sufficient information of the scaffolding should be published, so I’m curious what you mean here.
Stronger scaffolding makes evals better.
I think labs should at least demonstrate that their scaffolding is at least as good as some baseline. If there’s no established baseline scaffolding for the eval, they can say we did XYZ and we got n%, our secret scaffolding does better; when there is an established baseline scaffolding, they can compare to that (e.g. the best scaffold for SWE-bench Verified is called Agentless; in the o1 system card, OpenAI reported results from running its models in this scaffold.)