no amount of inference compute can make GPT-2 solve AIME
That’s because GPT-2 isn’t COT fine-tuned. Plenty of people are predicting it may be possible to get GPT-4 level performance out of a GPT-2 sized model with COT. How confident are you that they’re wrong? (o1-mini is dramatically better than GPT-4 and likely 30b-70b parameters)
I think that you can probably put a lot inside a 1.5B model, but I just think that such a model is going to be very dissimilar to GPT-2 and will likely utilize much more training compute and will probably be the result of pruning (pruned networks can be small, but it’s notoriously difficult to train equivalent networks without pruning).
Also, I’m not sure that the training of o1 can be called “COT fine-tuning” without asterisks, because we don’t know how much compute actually went into this training. It could easily be comparable to the compute necessary to train a model of the same size.
I haven’t seen a direct comparison between o1 and GPT-4. OpenAI only told us about GPT-4o, which itself seems to be a distilled mini-model. The comparison can also be unclear because o1 seems to be deliberately trained on coding/math tasks, unlike GPT-4o.
(I think that “making predictions about the future based on what OpenAI says about their models in public” should generally be treated as naive, because we are getting an intentionally obfuscated picture from them.)
What I am saying is that if you take the original GPT-2, COT prompt it, and fine-tune on outputs using some sort of RL, using less than 50% of the compute for training GPT-2, you are unlikely (<5%) to get GPT-4 level performance (because otherwise somebody would already do that.
This is an empirical question, so we’ll find out sooner-or-later. I’m not particularly concerned that “OpenAI is lying”, since COT scaling has been independently reproduced and matches what we see in other domains.
That’s because GPT-2 isn’t COT fine-tuned. Plenty of people are predicting it may be possible to get GPT-4 level performance out of a GPT-2 sized model with COT. How confident are you that they’re wrong? (o1-mini is dramatically better than GPT-4 and likely 30b-70b parameters)
I think that you can probably put a lot inside a 1.5B model, but I just think that such a model is going to be very dissimilar to GPT-2 and will likely utilize much more training compute and will probably be the result of pruning (pruned networks can be small, but it’s notoriously difficult to train equivalent networks without pruning).
Also, I’m not sure that the training of o1 can be called “COT fine-tuning” without asterisks, because we don’t know how much compute actually went into this training. It could easily be comparable to the compute necessary to train a model of the same size.
I haven’t seen a direct comparison between o1 and GPT-4. OpenAI only told us about GPT-4o, which itself seems to be a distilled mini-model. The comparison can also be unclear because o1 seems to be deliberately trained on coding/math tasks, unlike GPT-4o.
(I think that “making predictions about the future based on what OpenAI says about their models in public” should generally be treated as naive, because we are getting an intentionally obfuscated picture from them.)
What I am saying is that if you take the original GPT-2, COT prompt it, and fine-tune on outputs using some sort of RL, using less than 50% of the compute for training GPT-2, you are unlikely (<5%) to get GPT-4 level performance (because otherwise somebody would already do that.
This is an empirical question, so we’ll find out sooner-or-later. I’m not particularly concerned that “OpenAI is lying”, since COT scaling has been independently reproduced and matches what we see in other domains.