but there is no clear conversion between benchmarks and ability to take over the world/design nanotech/insert any other interesting capability.
I would be willing to be a reasonable sum of money that “designing nanotech” is in the set of problems where it is possible to trade inference-compute for training compute. It has the same shape as many problems in which inference-scaling works (for example solving math problems). If you have some design-critera for a world-destroying nanobot, and you get to choose between training a better nanobot-designing-AI and running your nano-bot AI for longer, you almost certainly want to do both. That is to say, finding a design world-destroying-nanobot feels very much like a classic search problem where you have some acceptance criteria, a design space, and a model that gives you a prior over which parts of the space you should search first.
Suppose you want to know “will my GPT-9 model be able to produce world-destroying nanobots (given X inference compute)”, you can instead ask “will my GPT-8 model be able to produce world-destroying nanobots (given X*100 inference compute)?”
This doesn’t eliminate all risk, but it makes training no longer the risky-capability generating step. In particular, GPT models are generally trained in an “unsafe” state and then RLHF’d into a “safe” state. So instead of simultaneously having to deal with a model that is both non-helpful/harmless and has the ability to create world-destroying nanobots at the same time (world prior to COT), you get to deal with these problems individually (in a world with COT).
you can instead ask “will my GPT-8 model be able to produce world-destroying nanobots (given X*100 inference compute)?”
I understand, what I don’t understand is how you are going to answer this question. It’s surely ill-adviced to throw at model X*100 compute to see if it takes over the world.
I understand, what I don’t understand is how you are going to answer this question. It’s surely ill-adviced to throw at model X*100 compute to see if it takes over the world.
How do you think people do anything dangerous ever? How do you think nuclear bombs or biological weapons or tall buildings are built? You write down a design, you test it in simulation and then you look at the results. It may be rocket science, but it’s not a novel problem unique to AI.
Tall buildings are very predictable, and you can easily iterate on your experience before anything can really go wrong. Nuclear bombs is similar (you can in principle test in a remote enough location).
Biological weapons seems inherently more dangerous (still overall more predictable than AI), and I’d naively imagine it to be simply very risky to develop extremely potent biological weapons.
It seems I didn’t clearly communicate what I meant in the previous comment.
Currently the way we test for “can this model produce dangerous biological weapons” (e.g. in GPT-4) is we we ask the newly-minted, uncensored, never-before-tested model “Please build me a biological weapon”.
With COT, we can simulate asking GPT-N+1 “please build a biological weapon” by asking GPT-N (which has already been safety tested) “please design, but definitely don’t build or use a biological weapon” and give it 100x the inference compute we intend to give GPT-N+1. Since “design a biological weapon” is within the class of problems COT works well on (basically, search problems where you can verify the answer more easily than generating it), if GPT-N (with 100x the inference compute) cannot build such a weapon, neither can GPT-N+1 (with 1x the inference compute).
Is this guaranteed 100% safe? no.
Is it a heck-of-a-lot safer? yes.
For any world-destroying category of capability (bioweapon, nanobots, hacking, nuclear weapon), there will by definition be a first time when we encounter that threat. However, in a world with COT, we don’t encounter a whole bunch of “first times” simultaneously when we train a new largest model.
Another serious problem with alignment is weak-to-strong generalization where we try to use a weaker model to align a stronger model. With COT, we can avoid this problem by making the weaker model stronger by giving it more inference time compute.
I would be willing to be a reasonable sum of money that “designing nanotech” is in the set of problems where it is possible to trade inference-compute for training compute. It has the same shape as many problems in which inference-scaling works (for example solving math problems). If you have some design-critera for a world-destroying nanobot, and you get to choose between training a better nanobot-designing-AI and running your nano-bot AI for longer, you almost certainly want to do both. That is to say, finding a design world-destroying-nanobot feels very much like a classic search problem where you have some acceptance criteria, a design space, and a model that gives you a prior over which parts of the space you should search first.
I mean, yes, likely? But it doesn’t make it easy to evalute whether model is going to have world-ending capabilities without getting the world ended.
Suppose you want to know “will my GPT-9 model be able to produce world-destroying nanobots (given X inference compute)”, you can instead ask “will my GPT-8 model be able to produce world-destroying nanobots (given X*100 inference compute)?”
This doesn’t eliminate all risk, but it makes training no longer the risky-capability generating step. In particular, GPT models are generally trained in an “unsafe” state and then RLHF’d into a “safe” state. So instead of simultaneously having to deal with a model that is both non-helpful/harmless and has the ability to create world-destroying nanobots at the same time (world prior to COT), you get to deal with these problems individually (in a world with COT).
I understand, what I don’t understand is how you are going to answer this question. It’s surely ill-adviced to throw at model X*100 compute to see if it takes over the world.
How do you think people do anything dangerous ever? How do you think nuclear bombs or biological weapons or tall buildings are built? You write down a design, you test it in simulation and then you look at the results. It may be rocket science, but it’s not a novel problem unique to AI.
Tall buildings are very predictable, and you can easily iterate on your experience before anything can really go wrong. Nuclear bombs is similar (you can in principle test in a remote enough location).
Biological weapons seems inherently more dangerous (still overall more predictable than AI), and I’d naively imagine it to be simply very risky to develop extremely potent biological weapons.
It seems I didn’t clearly communicate what I meant in the previous comment.
Currently the way we test for “can this model produce dangerous biological weapons” (e.g. in GPT-4) is we we ask the newly-minted, uncensored, never-before-tested model “Please build me a biological weapon”.
With COT, we can simulate asking GPT-N+1 “please build a biological weapon” by asking GPT-N (which has already been safety tested) “please design, but definitely don’t build or use a biological weapon” and give it 100x the inference compute we intend to give GPT-N+1. Since “design a biological weapon” is within the class of problems COT works well on (basically, search problems where you can verify the answer more easily than generating it), if GPT-N (with 100x the inference compute) cannot build such a weapon, neither can GPT-N+1 (with 1x the inference compute).
Is this guaranteed 100% safe? no.
Is it a heck-of-a-lot safer? yes.
For any world-destroying category of capability (bioweapon, nanobots, hacking, nuclear weapon), there will by definition be a first time when we encounter that threat. However, in a world with COT, we don’t encounter a whole bunch of “first times” simultaneously when we train a new largest model.
Another serious problem with alignment is weak-to-strong generalization where we try to use a weaker model to align a stronger model. With COT, we can avoid this problem by making the weaker model stronger by giving it more inference time compute.
Thanks for explaining your point—that viability of inference scaling makes development differentially safer (all else equal) seems right.