jacob_drori comments on Internal independent review for language model agent alignment

jacob_drori 28 Dec 2023 22:51 UTC
3 points
0
I hope that type of learning isn’t used
I share your hope, but I’m pessimistic. Using RL to continuously train the outer loop of an LLM agent seems like a no-brainer from a capabilities standpoint.

The alternative would be to pretrain the outer loop, and freeze the weights upon deployment. Then, I guess your plan would be to only use the independent reviewer after deployment, so that the reviewer’s decision never influences the outer-loop weights. Correct me if I’m wrong here.
I’m glad you plan to address this in a future post, and I look forward to reading it.
- Seth Herd 1 Dec 2024 22:52 UTC
  2 points
  0
  Parent
  We can now see some progress with o1 and the similar family of models. They are doing some training of the “outer loop” (to the limited extent they have one) with RL, but r1 and QwQ still produce very legible CoTs.
  So far.
  See also my clarification on how an opaque CoT would still allow some internal review, but probably not an independent one, in this other comment.
  See also Daniel Kokatijlo’s recent work on a “Shoggoth/Face” system that maintains legibility, and his other thinking on this topic. Maintaining legibility seems quite possible, but it does bear an alignment tax. This could be as low as a small fraction if the CoT largely works well when it’s condensed to language. I think it will; language is made for condensing complex concepts in order to clarify and communicate thinking (including communicating it to future selves to carry on with.
  It won’t be perfect, so there will be an alignment tax to be paid. But understanding what your model is thinking is very useful for developing further capabilities as well as for safety, so I think people may actually implement it if the tax turns out to be modest, maybe something like 50% greater compute during training and similar during inference.