in domains where there is a way to verify that the solution actually works, RL can scale to superhuman performance
Sure, the theory on that is solid. But how efficiently does it scale off-distribution, in practice?
The inference-time scaling laws, much like the pretraining scaling laws, are ultimately based on test sets whose entries are “shallow” (in the previously discussed sense). It doesn’t tell us much regarding how well the technique scales with the “conceptual depth” of a problem.
o3 took a million dollars in inference-time compute and unknown amounts in training-time compute just to solve the “easy” part of the FrontierMath benchmark (which likely take human experts single-digit hours, maybe <1 hour for particularly skilled humans). How much would be needed for beating the “hard” subset of FrontierMath? How much more still would be needed for problems that take individual researchers days; or problems that take entire math departments months; or problems that take entire fields decades?
It’s possible that the “synthetic data flywheel” works so well that the amount of human-researcher-hour-equivalents per unit of compute scales, say, exponentially with some aspect of o-series’ training, and so o6 in 2027 solves the Riemann Hypothesis.
Or it scales not that well, and o6 can barely clear real-life equivalents of hard FrontierMath problems. Perhaps instead the training costs (generating all the CoT trees on which RL training is then done) scale exponentially, while researcher-hour-equivalents per compute units scale linearly.
It doesn’t seem to me that we know which one it is yet. Do we?
Sure, the theory on that is solid. But how efficiently does it scale off-distribution, in practice?
The inference-time scaling laws, much like the pretraining scaling laws, are ultimately based on test sets whose entries are “shallow” (in the previously discussed sense). It doesn’t tell us much regarding how well the technique scales with the “conceptual depth” of a problem.
o3 took a million dollars in inference-time compute and unknown amounts in training-time compute just to solve the “easy” part of the FrontierMath benchmark (which likely take human experts single-digit hours, maybe <1 hour for particularly skilled humans). How much would be needed for beating the “hard” subset of FrontierMath? How much more still would be needed for problems that take individual researchers days; or problems that take entire math departments months; or problems that take entire fields decades?
It’s possible that the “synthetic data flywheel” works so well that the amount of human-researcher-hour-equivalents per unit of compute scales, say, exponentially with some aspect of o-series’ training, and so o6 in 2027 solves the Riemann Hypothesis.
Or it scales not that well, and o6 can barely clear real-life equivalents of hard FrontierMath problems. Perhaps instead the training costs (generating all the CoT trees on which RL training is then done) scale exponentially, while researcher-hour-equivalents per compute units scale linearly.
It doesn’t seem to me that we know which one it is yet. Do we?
I don’t think we know yet whether it will succeed in practice, or whether it training costs make it infeasibble to do.