no amount of inference compute can make GPT-2 solve AIME
That’s because GPT-2 isn’t COT fine-tuned. Plenty of people are predicting it may be possible to get GPT-4 level performance out of a GPT-2 sized model with COT. How confident are you that they’re wrong? (o1-mini is dramatically better than GPT-4 and likely 30b-70b parameters)
The other part of “this is certainly not how it works” is that yes, in part of cases you are going to be able to predict “results on this benchmark will go up 10% with such-n-such increase in compute” but there is no clear conversion between benchmarks and ability to take over the world/design nanotech/insert any other interesting capability.
but there is no clear conversion between benchmarks and ability to take over the world/design nanotech/insert any other interesting capability.
I would be willing to be a reasonable sum of money that “designing nanotech” is in the set of problems where it is possible to trade inference-compute for training compute. It has the same shape as many problems in which inference-scaling works (for example solving math problems). If you have some design-critera for a world-destroying nanobot, and you get to choose between training a better nanobot-designing-AI and running your nano-bot AI for longer, you almost certainly want to do both. That is to say, finding a design world-destroying-nanobot feels very much like a classic search problem where you have some acceptance criteria, a design space, and a model that gives you a prior over which parts of the space you should search first.
I think this is certainly not how it works because no amount of inference compute can make GPT-2 solve AIME.
That’s because GPT-2 isn’t COT fine-tuned. Plenty of people are predicting it may be possible to get GPT-4 level performance out of a GPT-2 sized model with COT. How confident are you that they’re wrong? (o1-mini is dramatically better than GPT-4 and likely 30b-70b parameters)
The other part of “this is certainly not how it works” is that yes, in part of cases you are going to be able to predict “results on this benchmark will go up 10% with such-n-such increase in compute” but there is no clear conversion between benchmarks and ability to take over the world/design nanotech/insert any other interesting capability.
I would be willing to be a reasonable sum of money that “designing nanotech” is in the set of problems where it is possible to trade inference-compute for training compute. It has the same shape as many problems in which inference-scaling works (for example solving math problems). If you have some design-critera for a world-destroying nanobot, and you get to choose between training a better nanobot-designing-AI and running your nano-bot AI for longer, you almost certainly want to do both. That is to say, finding a design world-destroying-nanobot feels very much like a classic search problem where you have some acceptance criteria, a design space, and a model that gives you a prior over which parts of the space you should search first.