Re the o1 AIME accuracy at test time scaling graphs: I think it’s crucial to understand that the test-time compute x-axis is likely wildly different from the train-time compute x-axis. You can throw 10s-100s of millions of dollars at train-time compute and still run a company. You can’t do the same for test-time compute each calculation again. The scale at which test-time compute happens on a per-call basis, and can happen to keep things anywhere near commercial viability, needs to be perhaps eight OOMs below train-time compute. Calling anything happening there a “scaling law” is a stretch of the term (very helpful for fundraising) and at best valid very locally.
If RL is actually happening at a compute scale beyond 10s of millions of dollars, and this gives much better final results than doing the same at a smaller scale, that would change my mind. Until then, I think scaling in any meaningful sense of the word is not what drives capabilities forward at the moment, but algorithmic improvement is. And this is not just coming from the currently leading labs. (Which can be seen e.g. here and here).
I agree that o1 doesn’t have a test time scaling law, at least not in a strong sense, while generatively pretrained transformers seem to have a scaling law in an extremely strong sense,.
I’d put my position like this: if you trained a GPT on a human generated internet a million times larger than the internet of our world, with a million times more parameters, for a million times more iterations, then I am confident that that GPT could beat the minecraft ender dragon zero shot.
If you gave o1 a quadrillion times more thinking time, there is no way in hell it would beat the ender dragon.
Re the o1 AIME accuracy at test time scaling graphs: I think it’s crucial to understand that the test-time compute x-axis is likely wildly different from the train-time compute x-axis. You can throw 10s-100s of millions of dollars at train-time compute and still run a company. You can’t do the same for test-time compute each calculation again. The scale at which test-time compute happens on a per-call basis, and can happen to keep things anywhere near commercial viability, needs to be perhaps eight OOMs below train-time compute. Calling anything happening there a “scaling law” is a stretch of the term (very helpful for fundraising) and at best valid very locally.
If RL is actually happening at a compute scale beyond 10s of millions of dollars, and this gives much better final results than doing the same at a smaller scale, that would change my mind. Until then, I think scaling in any meaningful sense of the word is not what drives capabilities forward at the moment, but algorithmic improvement is. And this is not just coming from the currently leading labs. (Which can be seen e.g. here and here).
I agree that o1 doesn’t have a test time scaling law, at least not in a strong sense, while generatively pretrained transformers seem to have a scaling law in an extremely strong sense,.
I’d put my position like this: if you trained a GPT on a human generated internet a million times larger than the internet of our world, with a million times more parameters, for a million times more iterations, then I am confident that that GPT could beat the minecraft ender dragon zero shot.
If you gave o1 a quadrillion times more thinking time, there is no way in hell it would beat the ender dragon.