I’m a little confused. What exactly is the function of the independent review, in your proposal? Are you imagining that the independent alignment reviewer provides some sort of “danger” score which is added to the loss? Or is the independent review used for some purpose other than providing a gradient signal?
Good question. I should try to explain this more clearly and succinctly. One planned post will try to do that.
In the meantime, let me briefly try to clarify here:
The internal review is applied to decision-making. If the review determines that an action might have negative impacts past an internal threshold, it won’t do that thing. At the least it will ask for human review; or it may be built so this user can’t override its internal review. There are lots of formulas and techniques one can imagine for weighing positive and negative predicted outcomes and picking an action.
There’s no relevant loss function. Language model agents aren’t doing continuous training. They don’t even periodically update the weights of their central LLM/foundation model. I think future versions will learn in a different way, by writing text files about particular experiences, skills, and knowledge.
At some point might well introduce network training, either in the core LLM, or a “control network” that controls “executive function”, like the outer loop of algorithmic code I described. I hope that type of learning isn’t used, because introducing RL training in-line re-introduces all of the problems of optimizing a goal that you haven’t carefully defined.
I share your hope, but I’m pessimistic. Using RL to continuously train the outer loop of an LLM agent seems like a no-brainer from a capabilities standpoint.
The alternative would be to pretrain the outer loop, and freeze the weights upon deployment. Then, I guess your plan would be to only use the independent reviewer after deployment, so that the reviewer’s decision never influences the outer-loop weights. Correct me if I’m wrong here.
I’m glad you plan to address this in a future post, and I look forward to reading it.
I’m a little confused. What exactly is the function of the independent review, in your proposal? Are you imagining that the independent alignment reviewer provides some sort of “danger” score which is added to the loss? Or is the independent review used for some purpose other than providing a gradient signal?
Good question. I should try to explain this more clearly and succinctly. One planned post will try to do that.
In the meantime, let me briefly try to clarify here:
The internal review is applied to decision-making. If the review determines that an action might have negative impacts past an internal threshold, it won’t do that thing. At the least it will ask for human review; or it may be built so this user can’t override its internal review. There are lots of formulas and techniques one can imagine for weighing positive and negative predicted outcomes and picking an action.
There’s no relevant loss function. Language model agents aren’t doing continuous training. They don’t even periodically update the weights of their central LLM/foundation model. I think future versions will learn in a different way, by writing text files about particular experiences, skills, and knowledge.
At some point might well introduce network training, either in the core LLM, or a “control network” that controls “executive function”, like the outer loop of algorithmic code I described. I hope that type of learning isn’t used, because introducing RL training in-line re-introduces all of the problems of optimizing a goal that you haven’t carefully defined.
I share your hope, but I’m pessimistic. Using RL to continuously train the outer loop of an LLM agent seems like a no-brainer from a capabilities standpoint.
The alternative would be to pretrain the outer loop, and freeze the weights upon deployment. Then, I guess your plan would be to only use the independent reviewer after deployment, so that the reviewer’s decision never influences the outer-loop weights. Correct me if I’m wrong here.
I’m glad you plan to address this in a future post, and I look forward to reading it.