This is close but not quite what I mean. Another attempt:
The literal Do Well At CodeForces task takes the form “you are given ~2 hours and ~6 problems, maximize this score function that takes into account the problems you solved and the times at which you solved them”. In this o3 is in top 200 (conditional on no cheating). So I agree there.
As you suggest, a more natural task would be “you are given t time and one problem, maximize your probability of solving it in the given time”. Already at t equal to ~1 hour (which is what contestants typically spend on the hardest problem they’ll solve), I’d expect o3 to be noticeably worse than top 200. This is because the CodeForces scoring function heavily penalizes slowness, and so if o3 and a human have equal performance in the contests, the human has to make up for their slowness by solving more problems. (Again, this is assuming that o3 is faster than humans in wall clock time.)
I separately believe that humans would scale better than AIs w.r.t. t, but that is not the point I’m making here.
This is close but not quite what I mean. Another attempt:
The literal Do Well At CodeForces task takes the form “you are given ~2 hours and ~6 problems, maximize this score function that takes into account the problems you solved and the times at which you solved them”. In this o3 is in top 200 (conditional on no cheating). So I agree there.
As you suggest, a more natural task would be “you are given t time and one problem, maximize your probability of solving it in the given time”. Already at t equal to ~1 hour (which is what contestants typically spend on the hardest problem they’ll solve), I’d expect o3 to be noticeably worse than top 200. This is because the CodeForces scoring function heavily penalizes slowness, and so if o3 and a human have equal performance in the contests, the human has to make up for their slowness by solving more problems. (Again, this is assuming that o3 is faster than humans in wall clock time.)
I separately believe that humans would scale better than AIs w.r.t. t, but that is not the point I’m making here.