@Scott Alexander, correction to the above: there are rumors that, like o1, o3 doesn’t generate runtime trees of thought either, and that they spent thousands-of-dollars’ worth of compute on single tasks by (1) having it generate a thousand separate CoTs, (2) outputting the answer the model produced most frequently. I. e., the “pruning meta-heuristic” I speculated about might just be the (manually-implemented) majority vote.
I think the guy in the quotes might be misinterpreting OpenAI researchers’ statements, but it’s possible.
In which case:
We have to slightly reinterpret the reason for having the model try a thousand times. Rather than outputting the correct answer if at least one try is correct, it outputs the correct answer if, in N tries, it produces the correct answer more frequently than incorrect ones. The fact that they had to set N = 1024 for best performance on ARC-AGI still suggests there’s a large amount of brute-forcing involved.
Since it implies that if N = 100, the correct answer isn’t more frequent than incorrect ones. So on the problems which o3 got wrong in the N = 6 regime but got right in the N = 1024 regime, the probability of any given CoT producing the correct answer is quite low.
This has similar implications for the FrontierMath performance, if the interpretation of the dark-blue vs. light-blue bars is that dark-blue is for N = 1 or N = 6, and light-blue is for N = bignumber.
We have to throw out everything about the “pruning” meta-heuristics; only the “steering” meta-heuristics exist. In this case, the transfer-of-performance problem would be that the “steering” heuristics only become better for math/programming; that RL only skewes the distribution over CoTs towards the high-quality ones for problems in those domains. (The metaphorical “taste” then still exists, but only within CoTs.)
(I now somewhat regret introducing the “steering vs. pruning meta-heuristic” terminology.)
Again, I think this isn’t really confirmed, but I can very much see it.
@Scott Alexander, correction to the above: there are rumors that, like o1, o3 doesn’t generate runtime trees of thought either, and that they spent thousands-of-dollars’ worth of compute on single tasks by (1) having it generate a thousand separate CoTs, (2) outputting the answer the model produced most frequently. I. e., the “pruning meta-heuristic” I speculated about might just be the (manually-implemented) majority vote.
I think the guy in the quotes might be misinterpreting OpenAI researchers’ statements, but it’s possible.
In which case:
We have to slightly reinterpret the reason for having the model try a thousand times. Rather than outputting the correct answer if at least one try is correct, it outputs the correct answer if, in N tries, it produces the correct answer more frequently than incorrect ones. The fact that they had to set N = 1024 for best performance on ARC-AGI still suggests there’s a large amount of brute-forcing involved.
Since it implies that if N = 100, the correct answer isn’t more frequent than incorrect ones. So on the problems which o3 got wrong in the N = 6 regime but got right in the N = 1024 regime, the probability of any given CoT producing the correct answer is quite low.
This has similar implications for the FrontierMath performance, if the interpretation of the dark-blue vs. light-blue bars is that dark-blue is for N = 1 or N = 6, and light-blue is for N = bignumber.
We have to throw out everything about the “pruning” meta-heuristics; only the “steering” meta-heuristics exist. In this case, the transfer-of-performance problem would be that the “steering” heuristics only become better for math/programming; that RL only skewes the distribution over CoTs towards the high-quality ones for problems in those domains. (The metaphorical “taste” then still exists, but only within CoTs.)
(I now somewhat regret introducing the “steering vs. pruning meta-heuristic” terminology.)
Again, I think this isn’t really confirmed, but I can very much see it.