Only after seeing the headline success vs test-time-compute figure did I bother to check it against my best estimates of how this sort of thing should scale. If we assume:
A set of questions of increasing difficulty (in this case 100), such that:
The probability of getting question i correct on a given “run” is an s-curve like 11+exp(k(i−i1/2)) for constants k and i1/2
The model does N “runs”
If any are correct, the model finds the correct answer 100% of the time
N=1 gives a score of 20⁄100
Then, depending on k (i1/2 is is uniquely defined by k in this case), we get the following chance of success vs question difficulty rank curves:
Higher values of k make it look like a sharper “cutoff”, i.e. more questions are correct ~100% of the time, but more are wrong ~100% of the time. Lower values of k make the curve less sharp, so the easier questions are gotten wrong more often, and the harder questions are gotten right more often.
Which gives the following best-of-N sample curves, which are roughly linear in logN in the region between 20⁄100 and 80⁄100. The smaller the value of k, the steeper the curve.
Since the headline figure spans around 2 orders of magnitude compute, the model on appears to be performing on AIMES similarly to a best-of-N sampling on the 0.05<k<0.1 case.
If we allow the model to split the task up into S subtasks (assuming this creates no overhead and each subtask’s solution can be verified independently and accurately) then we get a steeper gradient roughly proportional to S, and a small amount of curvature.
Of course this is unreasonable, since this requires correctly identifying the shortest path to success with independently-verifiable subtasks. In reality, we would expect the model to use extra compute on dead-end subtasks (for example, when doing a mathematical proof, verifying a correct statement which doesn’t actually get you closer to the answer, or when coding, correctly writing a function which is not actually useful for the final product) so performance scaling from breaking up a task will almost certainly be a bit worse than this.
Whether or not the model is literally doing best-of-N sampling at inference time (probably it’s doing something at least a bit more complex) it seems like it scales similarly to best-of-N under these conditions.
Only after seeing the headline success vs test-time-compute figure did I bother to check it against my best estimates of how this sort of thing should scale. If we assume:
A set of questions of increasing difficulty (in this case 100), such that:
The probability of getting question i correct on a given “run” is an s-curve like 11+exp(k(i−i1/2)) for constants k and i1/2
The model does N “runs”
If any are correct, the model finds the correct answer 100% of the time
N=1 gives a score of 20⁄100
Then, depending on k (i1/2 is is uniquely defined by k in this case), we get the following chance of success vs question difficulty rank curves:
Higher values of k make it look like a sharper “cutoff”, i.e. more questions are correct ~100% of the time, but more are wrong ~100% of the time. Lower values of k make the curve less sharp, so the easier questions are gotten wrong more often, and the harder questions are gotten right more often.
Which gives the following best-of-N sample curves, which are roughly linear in logN in the region between 20⁄100 and 80⁄100. The smaller the value of k, the steeper the curve.
Since the headline figure spans around 2 orders of magnitude compute, the model on appears to be performing on AIMES similarly to a best-of-N sampling on the 0.05<k<0.1 case.
If we allow the model to split the task up into S subtasks (assuming this creates no overhead and each subtask’s solution can be verified independently and accurately) then we get a steeper gradient roughly proportional to S, and a small amount of curvature.
Of course this is unreasonable, since this requires correctly identifying the shortest path to success with independently-verifiable subtasks. In reality, we would expect the model to use extra compute on dead-end subtasks (for example, when doing a mathematical proof, verifying a correct statement which doesn’t actually get you closer to the answer, or when coding, correctly writing a function which is not actually useful for the final product) so performance scaling from breaking up a task will almost certainly be a bit worse than this.
Whether or not the model is literally doing best-of-N sampling at inference time (probably it’s doing something at least a bit more complex) it seems like it scales similarly to best-of-N under these conditions.