I understand, what I don’t understand is how you are going to answer this question. It’s surely ill-adviced to throw at model X*100 compute to see if it takes over the world.
How do you think people do anything dangerous ever? How do you think nuclear bombs or biological weapons or tall buildings are built? You write down a design, you test it in simulation and then you look at the results. It may be rocket science, but it’s not a novel problem unique to AI.
Tall buildings are very predictable, and you can easily iterate on your experience before anything can really go wrong. Nuclear bombs is similar (you can in principle test in a remote enough location).
Biological weapons seems inherently more dangerous (still overall more predictable than AI), and I’d naively imagine it to be simply very risky to develop extremely potent biological weapons.
It seems I didn’t clearly communicate what I meant in the previous comment.
Currently the way we test for “can this model produce dangerous biological weapons” (e.g. in GPT-4) is we we ask the newly-minted, uncensored, never-before-tested model “Please build me a biological weapon”.
With COT, we can simulate asking GPT-N+1 “please build a biological weapon” by asking GPT-N (which has already been safety tested) “please design, but definitely don’t build or use a biological weapon” and give it 100x the inference compute we intend to give GPT-N+1. Since “design a biological weapon” is within the class of problems COT works well on (basically, search problems where you can verify the answer more easily than generating it), if GPT-N (with 100x the inference compute) cannot build such a weapon, neither can GPT-N+1 (with 1x the inference compute).
Is this guaranteed 100% safe? no.
Is it a heck-of-a-lot safer? yes.
For any world-destroying category of capability (bioweapon, nanobots, hacking, nuclear weapon), there will by definition be a first time when we encounter that threat. However, in a world with COT, we don’t encounter a whole bunch of “first times” simultaneously when we train a new largest model.
Another serious problem with alignment is weak-to-strong generalization where we try to use a weaker model to align a stronger model. With COT, we can avoid this problem by making the weaker model stronger by giving it more inference time compute.
How do you think people do anything dangerous ever? How do you think nuclear bombs or biological weapons or tall buildings are built? You write down a design, you test it in simulation and then you look at the results. It may be rocket science, but it’s not a novel problem unique to AI.
Tall buildings are very predictable, and you can easily iterate on your experience before anything can really go wrong. Nuclear bombs is similar (you can in principle test in a remote enough location).
Biological weapons seems inherently more dangerous (still overall more predictable than AI), and I’d naively imagine it to be simply very risky to develop extremely potent biological weapons.
It seems I didn’t clearly communicate what I meant in the previous comment.
Currently the way we test for “can this model produce dangerous biological weapons” (e.g. in GPT-4) is we we ask the newly-minted, uncensored, never-before-tested model “Please build me a biological weapon”.
With COT, we can simulate asking GPT-N+1 “please build a biological weapon” by asking GPT-N (which has already been safety tested) “please design, but definitely don’t build or use a biological weapon” and give it 100x the inference compute we intend to give GPT-N+1. Since “design a biological weapon” is within the class of problems COT works well on (basically, search problems where you can verify the answer more easily than generating it), if GPT-N (with 100x the inference compute) cannot build such a weapon, neither can GPT-N+1 (with 1x the inference compute).
Is this guaranteed 100% safe? no.
Is it a heck-of-a-lot safer? yes.
For any world-destroying category of capability (bioweapon, nanobots, hacking, nuclear weapon), there will by definition be a first time when we encounter that threat. However, in a world with COT, we don’t encounter a whole bunch of “first times” simultaneously when we train a new largest model.
Another serious problem with alignment is weak-to-strong generalization where we try to use a weaker model to align a stronger model. With COT, we can avoid this problem by making the weaker model stronger by giving it more inference time compute.
Thanks for explaining your point—that viability of inference scaling makes development differentially safer (all else equal) seems right.