CodeForces ratings are determined by your performance in competitions, and your score in a competition is determined, in part, by how quickly you solve the problems. I’d expect o3 to be much faster than human contestants. (The specifics are unclear—I’m not sure how a large test-time compute usage translates to wall-clock time—but at the very least o3 parallelizes between problems.)
This inflates the results relative to humans somewhat. So one shouldn’t think that o3 is in the top 200 in terms of algorithmic problem solving skills.
As in, for the literal task of “solve this code forces problem in 30 minutes” (or whatever the competition allows), o3 is ~ top 200 among people who do codeforces (supposing o3 didn’t cheat on wall clock time). However, if you gave humans 8 serial hours and o3 8 serial hours, much more than 200 humans would be better. (Or maybe the cross over is at 64 serial hours instead of 8.)
This is close but not quite what I mean. Another attempt:
The literal Do Well At CodeForces task takes the form “you are given ~2 hours and ~6 problems, maximize this score function that takes into account the problems you solved and the times at which you solved them”. In this o3 is in top 200 (conditional on no cheating). So I agree there.
As you suggest, a more natural task would be “you are given t time and one problem, maximize your probability of solving it in the given time”. Already at t equal to ~1 hour (which is what contestants typically spend on the hardest problem they’ll solve), I’d expect o3 to be noticeably worse than top 200. This is because the CodeForces scoring function heavily penalizes slowness, and so if o3 and a human have equal performance in the contests, the human has to make up for their slowness by solving more problems. (Again, this is assuming that o3 is faster than humans in wall clock time.)
I separately believe that humans would scale better than AIs w.r.t. t, but that is not the point I’m making here.
CodeForces ratings are determined by your performance in competitions, and your score in a competition is determined, in part, by how quickly you solve the problems. I’d expect o3 to be much faster than human contestants. (The specifics are unclear—I’m not sure how a large test-time compute usage translates to wall-clock time—but at the very least o3 parallelizes between problems.)
This inflates the results relative to humans somewhat. So one shouldn’t think that o3 is in the top 200 in terms of algorithmic problem solving skills.
As in, for the literal task of “solve this code forces problem in 30 minutes” (or whatever the competition allows), o3 is ~ top 200 among people who do codeforces (supposing o3 didn’t cheat on wall clock time). However, if you gave humans 8 serial hours and o3 8 serial hours, much more than 200 humans would be better. (Or maybe the cross over is at 64 serial hours instead of 8.)
Is this what you mean?
This is close but not quite what I mean. Another attempt:
The literal Do Well At CodeForces task takes the form “you are given ~2 hours and ~6 problems, maximize this score function that takes into account the problems you solved and the times at which you solved them”. In this o3 is in top 200 (conditional on no cheating). So I agree there.
As you suggest, a more natural task would be “you are given t time and one problem, maximize your probability of solving it in the given time”. Already at t equal to ~1 hour (which is what contestants typically spend on the hardest problem they’ll solve), I’d expect o3 to be noticeably worse than top 200. This is because the CodeForces scoring function heavily penalizes slowness, and so if o3 and a human have equal performance in the contests, the human has to make up for their slowness by solving more problems. (Again, this is assuming that o3 is faster than humans in wall clock time.)
I separately believe that humans would scale better than AIs w.r.t. t, but that is not the point I’m making here.