Competitors receive a higher score for submitting their solutions faster. Because models can think in parallel and simultaneously attempt all problems, they have an innate advantage over humans. We elected to reduce this advantage in our primary results by estimating o3’s score for each solved problem as the median of the scores of the human participants that solved that problem in the contest with the same number of failed attempts.
We could instead use the model’s real thinking time to compute ratings. o3 uses a learned scoring function for test-time ranking in addition to a chain of thought. This process is perfectly parallel and true model submission times therefore depend on the number of available GPU during the contest. On a very large cluster the time taken to pick the top-ranked solutions is (very slightly more than) the maximum over the thinking times for each candidate submission. Using this maximum parallelism assumption and the sequential o3 sampling speed would result in a higher estimated rating than presented here. We note that because sequential test-time compute has grown rapidly since the early language models, it was not guaranteed that models would solve problems quickly compared to humans, but in practice o3 does.
See appendix B.3 in particular:
Huh, I tried to paste that excerpt as an image to my comment, but it disappeared. Thanks.