no amount of inference compute can make GPT-2 solve AIME
That’s because GPT-2 isn’t COT fine-tuned. Plenty of people are predicting it may be possible to get GPT-4 level performance out of a GPT-2 sized model with COT. How confident are you that they’re wrong? (o1-mini is dramatically better than GPT-4 and likely 30b-70b parameters)
I think that you can probably put a lot inside a 1.5B model, but I just think that such a model is going to be very dissimilar to GPT-2 and will likely utilize much more training compute and will probably be the result of pruning (pruned networks can be small, but it’s notoriously difficult to train equivalent networks without pruning).
Also, I’m not sure that the training of o1 can be called “COT fine-tuning” without asterisks, because we don’t know how much compute actually went into this training. It could easily be comparable to the compute necessary to train a model of the same size.
I haven’t seen a direct comparison between o1 and GPT-4. OpenAI only told us about GPT-4o, which itself seems to be a distilled mini-model. The comparison can also be unclear because o1 seems to be deliberately trained on coding/math tasks, unlike GPT-4o.
(I think that “making predictions about the future based on what OpenAI says about their models in public” should generally be treated as naive, because we are getting an intentionally obfuscated picture from them.)
What I am saying is that if you take the original GPT-2, COT prompt it, and fine-tune on outputs using some sort of RL, using less than 50% of the compute for training GPT-2, you are unlikely (<5%) to get GPT-4 level performance (because otherwise somebody would already do that.
This is an empirical question, so we’ll find out sooner-or-later. I’m not particularly concerned that “OpenAI is lying”, since COT scaling has been independently reproduced and matches what we see in other domains.
The other part of “this is certainly not how it works” is that yes, in part of cases you are going to be able to predict “results on this benchmark will go up 10% with such-n-such increase in compute” but there is no clear conversion between benchmarks and ability to take over the world/design nanotech/insert any other interesting capability.
but there is no clear conversion between benchmarks and ability to take over the world/design nanotech/insert any other interesting capability.
I would be willing to be a reasonable sum of money that “designing nanotech” is in the set of problems where it is possible to trade inference-compute for training compute. It has the same shape as many problems in which inference-scaling works (for example solving math problems). If you have some design-critera for a world-destroying nanobot, and you get to choose between training a better nanobot-designing-AI and running your nano-bot AI for longer, you almost certainly want to do both. That is to say, finding a design world-destroying-nanobot feels very much like a classic search problem where you have some acceptance criteria, a design space, and a model that gives you a prior over which parts of the space you should search first.
Suppose you want to know “will my GPT-9 model be able to produce world-destroying nanobots (given X inference compute)”, you can instead ask “will my GPT-8 model be able to produce world-destroying nanobots (given X*100 inference compute)?”
This doesn’t eliminate all risk, but it makes training no longer the risky-capability generating step. In particular, GPT models are generally trained in an “unsafe” state and then RLHF’d into a “safe” state. So instead of simultaneously having to deal with a model that is both non-helpful/harmless and has the ability to create world-destroying nanobots at the same time (world prior to COT), you get to deal with these problems individually (in a world with COT).
you can instead ask “will my GPT-8 model be able to produce world-destroying nanobots (given X*100 inference compute)?”
I understand, what I don’t understand is how you are going to answer this question. It’s surely ill-adviced to throw at model X*100 compute to see if it takes over the world.
I understand, what I don’t understand is how you are going to answer this question. It’s surely ill-adviced to throw at model X*100 compute to see if it takes over the world.
How do you think people do anything dangerous ever? How do you think nuclear bombs or biological weapons or tall buildings are built? You write down a design, you test it in simulation and then you look at the results. It may be rocket science, but it’s not a novel problem unique to AI.
Tall buildings are very predictable, and you can easily iterate on your experience before anything can really go wrong. Nuclear bombs is similar (you can in principle test in a remote enough location).
Biological weapons seems inherently more dangerous (still overall more predictable than AI), and I’d naively imagine it to be simply very risky to develop extremely potent biological weapons.
It seems I didn’t clearly communicate what I meant in the previous comment.
Currently the way we test for “can this model produce dangerous biological weapons” (e.g. in GPT-4) is we we ask the newly-minted, uncensored, never-before-tested model “Please build me a biological weapon”.
With COT, we can simulate asking GPT-N+1 “please build a biological weapon” by asking GPT-N (which has already been safety tested) “please design, but definitely don’t build or use a biological weapon” and give it 100x the inference compute we intend to give GPT-N+1. Since “design a biological weapon” is within the class of problems COT works well on (basically, search problems where you can verify the answer more easily than generating it), if GPT-N (with 100x the inference compute) cannot build such a weapon, neither can GPT-N+1 (with 1x the inference compute).
Is this guaranteed 100% safe? no.
Is it a heck-of-a-lot safer? yes.
For any world-destroying category of capability (bioweapon, nanobots, hacking, nuclear weapon), there will by definition be a first time when we encounter that threat. However, in a world with COT, we don’t encounter a whole bunch of “first times” simultaneously when we train a new largest model.
Another serious problem with alignment is weak-to-strong generalization where we try to use a weaker model to align a stronger model. With COT, we can avoid this problem by making the weaker model stronger by giving it more inference time compute.
I think this is certainly not how it works because no amount of inference compute can make GPT-2 solve AIME.
That’s because GPT-2 isn’t COT fine-tuned. Plenty of people are predicting it may be possible to get GPT-4 level performance out of a GPT-2 sized model with COT. How confident are you that they’re wrong? (o1-mini is dramatically better than GPT-4 and likely 30b-70b parameters)
I think that you can probably put a lot inside a 1.5B model, but I just think that such a model is going to be very dissimilar to GPT-2 and will likely utilize much more training compute and will probably be the result of pruning (pruned networks can be small, but it’s notoriously difficult to train equivalent networks without pruning).
Also, I’m not sure that the training of o1 can be called “COT fine-tuning” without asterisks, because we don’t know how much compute actually went into this training. It could easily be comparable to the compute necessary to train a model of the same size.
I haven’t seen a direct comparison between o1 and GPT-4. OpenAI only told us about GPT-4o, which itself seems to be a distilled mini-model. The comparison can also be unclear because o1 seems to be deliberately trained on coding/math tasks, unlike GPT-4o.
(I think that “making predictions about the future based on what OpenAI says about their models in public” should generally be treated as naive, because we are getting an intentionally obfuscated picture from them.)
What I am saying is that if you take the original GPT-2, COT prompt it, and fine-tune on outputs using some sort of RL, using less than 50% of the compute for training GPT-2, you are unlikely (<5%) to get GPT-4 level performance (because otherwise somebody would already do that.
This is an empirical question, so we’ll find out sooner-or-later. I’m not particularly concerned that “OpenAI is lying”, since COT scaling has been independently reproduced and matches what we see in other domains.
The other part of “this is certainly not how it works” is that yes, in part of cases you are going to be able to predict “results on this benchmark will go up 10% with such-n-such increase in compute” but there is no clear conversion between benchmarks and ability to take over the world/design nanotech/insert any other interesting capability.
I would be willing to be a reasonable sum of money that “designing nanotech” is in the set of problems where it is possible to trade inference-compute for training compute. It has the same shape as many problems in which inference-scaling works (for example solving math problems). If you have some design-critera for a world-destroying nanobot, and you get to choose between training a better nanobot-designing-AI and running your nano-bot AI for longer, you almost certainly want to do both. That is to say, finding a design world-destroying-nanobot feels very much like a classic search problem where you have some acceptance criteria, a design space, and a model that gives you a prior over which parts of the space you should search first.
I mean, yes, likely? But it doesn’t make it easy to evalute whether model is going to have world-ending capabilities without getting the world ended.
Suppose you want to know “will my GPT-9 model be able to produce world-destroying nanobots (given X inference compute)”, you can instead ask “will my GPT-8 model be able to produce world-destroying nanobots (given X*100 inference compute)?”
This doesn’t eliminate all risk, but it makes training no longer the risky-capability generating step. In particular, GPT models are generally trained in an “unsafe” state and then RLHF’d into a “safe” state. So instead of simultaneously having to deal with a model that is both non-helpful/harmless and has the ability to create world-destroying nanobots at the same time (world prior to COT), you get to deal with these problems individually (in a world with COT).
I understand, what I don’t understand is how you are going to answer this question. It’s surely ill-adviced to throw at model X*100 compute to see if it takes over the world.
How do you think people do anything dangerous ever? How do you think nuclear bombs or biological weapons or tall buildings are built? You write down a design, you test it in simulation and then you look at the results. It may be rocket science, but it’s not a novel problem unique to AI.
Tall buildings are very predictable, and you can easily iterate on your experience before anything can really go wrong. Nuclear bombs is similar (you can in principle test in a remote enough location).
Biological weapons seems inherently more dangerous (still overall more predictable than AI), and I’d naively imagine it to be simply very risky to develop extremely potent biological weapons.
It seems I didn’t clearly communicate what I meant in the previous comment.
Currently the way we test for “can this model produce dangerous biological weapons” (e.g. in GPT-4) is we we ask the newly-minted, uncensored, never-before-tested model “Please build me a biological weapon”.
With COT, we can simulate asking GPT-N+1 “please build a biological weapon” by asking GPT-N (which has already been safety tested) “please design, but definitely don’t build or use a biological weapon” and give it 100x the inference compute we intend to give GPT-N+1. Since “design a biological weapon” is within the class of problems COT works well on (basically, search problems where you can verify the answer more easily than generating it), if GPT-N (with 100x the inference compute) cannot build such a weapon, neither can GPT-N+1 (with 1x the inference compute).
Is this guaranteed 100% safe? no.
Is it a heck-of-a-lot safer? yes.
For any world-destroying category of capability (bioweapon, nanobots, hacking, nuclear weapon), there will by definition be a first time when we encounter that threat. However, in a world with COT, we don’t encounter a whole bunch of “first times” simultaneously when we train a new largest model.
Another serious problem with alignment is weak-to-strong generalization where we try to use a weaker model to align a stronger model. With COT, we can avoid this problem by making the weaker model stronger by giving it more inference time compute.
Thanks for explaining your point—that viability of inference scaling makes development differentially safer (all else equal) seems right.