J Bostock comments on OpenAI o1

J Bostock 15 Sep 2024 11:52 UTC
3 points
0
Only after seeing the headline success vs test-time-compute figure did I bother to check it against my best estimates of how this sort of thing should scale. If we assume:
1. A set of questions of increasing difficulty (in this case 100), such that:
2. The probability of getting question $i$ correct on a given “run” is an s-curve like $\frac{1}{1 + e x p (k (i - i_{1 / 2}))}$ for constants $k$ and $i_{1 / 2}$
3. The model does $N$ “runs”
4. If any are correct, the model finds the correct answer 100% of the time
5. $N = 1$ gives a score of ²⁰⁄₁₀₀
Then, depending on $k$ ( $i_{1 / 2}$ is is uniquely defined by $k$ in this case), we get the following chance of success vs question difficulty rank curves:
Higher values of $k$ make it look like a sharper “cutoff”, i.e. more questions are correct ~100% of the time, but more are wrong ~100% of the time. Lower values of $k$ make the curve less sharp, so the easier questions are gotten wrong more often, and the harder questions are gotten right more often.
Which gives the following best-of-N sample curves, which are roughly linear in $log N$ in the region between ²⁰⁄₁₀₀ and ⁸⁰⁄₁₀₀. The smaller the value of $k$ , the steeper the curve.
Since the headline figure spans around 2 orders of magnitude compute, the model on appears to be performing on AIMES similarly to a best-of-N sampling on the $0.05 < k < 0.1$ case.
If we allow the model to split the task up into $S$ subtasks (assuming this creates no overhead and each subtask’s solution can be verified independently and accurately) then we get a steeper gradient roughly proportional to $S$ , and a small amount of curvature.
Of course this is unreasonable, since this requires correctly identifying the shortest path to success with independently-verifiable subtasks. In reality, we would expect the model to use extra compute on dead-end subtasks (for example, when doing a mathematical proof, verifying a correct statement which doesn’t actually get you closer to the answer, or when coding, correctly writing a function which is not actually useful for the final product) so performance scaling from breaking up a task will almost certainly be a bit worse than this.
Whether or not the model is literally doing best-of-N sampling at inference time (probably it’s doing something at least a bit more complex) it seems like it scales similarly to best-of-N under these conditions.