Logan Zoellner comments on COT Scaling implies slower takeoff speeds

Logan Zoellner 29 Sep 2024 21:04 UTC
8 points
7
It seems I didn’t clearly communicate what I meant in the previous comment.

Currently the way we test for “can this model produce dangerous biological weapons” (e.g. in GPT-4) is we we ask the newly-minted, uncensored, never-before-tested model “Please build me a biological weapon”.
With COT, we can simulate asking GPT-N+1 “please build a biological weapon” by asking GPT-N (which has already been safety tested) “please design, but definitely don’t build or use a biological weapon” and give it 100x the inference compute we intend to give GPT-N+1. Since “design a biological weapon” is within the class of problems COT works well on (basically, search problems where you can verify the answer more easily than generating it), if GPT-N (with 100x the inference compute) cannot build such a weapon, neither can GPT-N+1 (with 1x the inference compute).
Is this guaranteed 100% safe? no.
Is it a heck-of-a-lot safer? yes.
For any world-destroying category of capability (bioweapon, nanobots, hacking, nuclear weapon), there will by definition be a first time when we encounter that threat. However, in a world with COT, we don’t encounter a whole bunch of “first times” simultaneously when we train a new largest model.
Another serious problem with alignment is weak-to-strong generalization where we try to use a weaker model to align a stronger model. With COT, we can avoid this problem by making the weaker model stronger by giving it more inference time compute.
- Amalthea 29 Sep 2024 21:28 UTC
  2 points
  1
  Parent
  Thanks for explaining your point—that viability of inference scaling makes development differentially safer (all else equal) seems right.