It seems I didn’t clearly communicate what I meant in the previous comment.
Currently the way we test for “can this model produce dangerous biological weapons” (e.g. in GPT-4) is we we ask the newly-minted, uncensored, never-before-tested model “Please build me a biological weapon”.
With COT, we can simulate asking GPT-N+1 “please build a biological weapon” by asking GPT-N (which has already been safety tested) “please design, but definitely don’t build or use a biological weapon” and give it 100x the inference compute we intend to give GPT-N+1. Since “design a biological weapon” is within the class of problems COT works well on (basically, search problems where you can verify the answer more easily than generating it), if GPT-N (with 100x the inference compute) cannot build such a weapon, neither can GPT-N+1 (with 1x the inference compute).
Is this guaranteed 100% safe? no.
Is it a heck-of-a-lot safer? yes.
For any world-destroying category of capability (bioweapon, nanobots, hacking, nuclear weapon), there will by definition be a first time when we encounter that threat. However, in a world with COT, we don’t encounter a whole bunch of “first times” simultaneously when we train a new largest model.
Another serious problem with alignment is weak-to-strong generalization where we try to use a weaker model to align a stronger model. With COT, we can avoid this problem by making the weaker model stronger by giving it more inference time compute.
It seems I didn’t clearly communicate what I meant in the previous comment.
Currently the way we test for “can this model produce dangerous biological weapons” (e.g. in GPT-4) is we we ask the newly-minted, uncensored, never-before-tested model “Please build me a biological weapon”.
With COT, we can simulate asking GPT-N+1 “please build a biological weapon” by asking GPT-N (which has already been safety tested) “please design, but definitely don’t build or use a biological weapon” and give it 100x the inference compute we intend to give GPT-N+1. Since “design a biological weapon” is within the class of problems COT works well on (basically, search problems where you can verify the answer more easily than generating it), if GPT-N (with 100x the inference compute) cannot build such a weapon, neither can GPT-N+1 (with 1x the inference compute).
Is this guaranteed 100% safe? no.
Is it a heck-of-a-lot safer? yes.
For any world-destroying category of capability (bioweapon, nanobots, hacking, nuclear weapon), there will by definition be a first time when we encounter that threat. However, in a world with COT, we don’t encounter a whole bunch of “first times” simultaneously when we train a new largest model.
Another serious problem with alignment is weak-to-strong generalization where we try to use a weaker model to align a stronger model. With COT, we can avoid this problem by making the weaker model stronger by giving it more inference time compute.
Thanks for explaining your point—that viability of inference scaling makes development differentially safer (all else equal) seems right.