I’m a little confused. What exactly is the function of the independent review, in your proposal? Are you imagining that the independent alignment reviewer provides some sort of “danger” score which is added to the loss? Or is the independent review used for some purpose other than providing a gradient signal?
Good question. I should try to explain this more clearly and succinctly. One planned post will try to do that.
In the meantime, let me briefly try to clarify here:
The internal review is applied to decision-making. If the review determines that an action might have negative impacts past an internal threshold, it won’t do that thing. At the least it will ask for human review; or it may be built so this user can’t override its internal review. There are lots of formulas and techniques one can imagine for weighing positive and negative predicted outcomes and picking an action.
There’s no relevant loss function. Language model agents aren’t doing continuous training. They don’t even periodically update the weights of their central LLM/foundation model. I think future versions will learn in a different way, by writing text files about particular experiences, skills, and knowledge.
At some point might well introduce network training, either in the core LLM, or a “control network” that controls “executive function”, like the outer loop of algorithmic code I described. I hope that type of learning isn’t used, because introducing RL training in-line re-introduces all of the problems of optimizing a goal that you haven’t carefully defined.
I share your hope, but I’m pessimistic. Using RL to continuously train the outer loop of an LLM agent seems like a no-brainer from a capabilities standpoint.
The alternative would be to pretrain the outer loop, and freeze the weights upon deployment. Then, I guess your plan would be to only use the independent reviewer after deployment, so that the reviewer’s decision never influences the outer-loop weights. Correct me if I’m wrong here.
I’m glad you plan to address this in a future post, and I look forward to reading it.
We can now see some progress with o1 and the similar family of models. They are doing some training of the “outer loop” (to the limited extent they have one) with RL, but r1 and QwQ still produce very legible CoTs.
So far.
See also my clarification on how an opaque CoT would still allow some internal review, but probably not an independent one, in this other comment.
See also Daniel Kokatijlo’s recent work on a “Shoggoth/Face” system that maintains legibility, and his other thinking on this topic. Maintaining legibility seems quite possible, but it does bear an alignment tax. This could be as low as a small fraction if the CoT largely works well when it’s condensed to language. I think it will; language is made for condensing complex concepts in order to clarify and communicate thinking (including communicating it to future selves to carry on with.
It won’t be perfect, so there will be an alignment tax to be paid. But understanding what your model is thinking is very useful for developing further capabilities as well as for safety, so I think people may actually implement it if the tax turns out to be modest, maybe something like 50% greater compute during training and similar during inference.
I’m a little confused. What exactly is the function of the independent review, in your proposal? Are you imagining that the independent alignment reviewer provides some sort of “danger” score which is added to the loss? Or is the independent review used for some purpose other than providing a gradient signal?
Good question. I should try to explain this more clearly and succinctly. One planned post will try to do that.
In the meantime, let me briefly try to clarify here:
The internal review is applied to decision-making. If the review determines that an action might have negative impacts past an internal threshold, it won’t do that thing. At the least it will ask for human review; or it may be built so this user can’t override its internal review. There are lots of formulas and techniques one can imagine for weighing positive and negative predicted outcomes and picking an action.
There’s no relevant loss function. Language model agents aren’t doing continuous training. They don’t even periodically update the weights of their central LLM/foundation model. I think future versions will learn in a different way, by writing text files about particular experiences, skills, and knowledge.
At some point might well introduce network training, either in the core LLM, or a “control network” that controls “executive function”, like the outer loop of algorithmic code I described. I hope that type of learning isn’t used, because introducing RL training in-line re-introduces all of the problems of optimizing a goal that you haven’t carefully defined.
I share your hope, but I’m pessimistic. Using RL to continuously train the outer loop of an LLM agent seems like a no-brainer from a capabilities standpoint.
The alternative would be to pretrain the outer loop, and freeze the weights upon deployment. Then, I guess your plan would be to only use the independent reviewer after deployment, so that the reviewer’s decision never influences the outer-loop weights. Correct me if I’m wrong here.
I’m glad you plan to address this in a future post, and I look forward to reading it.
We can now see some progress with o1 and the similar family of models. They are doing some training of the “outer loop” (to the limited extent they have one) with RL, but r1 and QwQ still produce very legible CoTs.
So far.
See also my clarification on how an opaque CoT would still allow some internal review, but probably not an independent one, in this other comment.
See also Daniel Kokatijlo’s recent work on a “Shoggoth/Face” system that maintains legibility, and his other thinking on this topic. Maintaining legibility seems quite possible, but it does bear an alignment tax. This could be as low as a small fraction if the CoT largely works well when it’s condensed to language. I think it will; language is made for condensing complex concepts in order to clarify and communicate thinking (including communicating it to future selves to carry on with.
It won’t be perfect, so there will be an alignment tax to be paid. But understanding what your model is thinking is very useful for developing further capabilities as well as for safety, so I think people may actually implement it if the tax turns out to be modest, maybe something like 50% greater compute during training and similar during inference.