I share your hope, but I’m pessimistic. Using RL to continuously train the outer loop of an LLM agent seems like a no-brainer from a capabilities standpoint.
The alternative would be to pretrain the outer loop, and freeze the weights upon deployment. Then, I guess your plan would be to only use the independent reviewer after deployment, so that the reviewer’s decision never influences the outer-loop weights. Correct me if I’m wrong here.
I’m glad you plan to address this in a future post, and I look forward to reading it.
I share your hope, but I’m pessimistic. Using RL to continuously train the outer loop of an LLM agent seems like a no-brainer from a capabilities standpoint.
The alternative would be to pretrain the outer loop, and freeze the weights upon deployment. Then, I guess your plan would be to only use the independent reviewer after deployment, so that the reviewer’s decision never influences the outer-loop weights. Correct me if I’m wrong here.
I’m glad you plan to address this in a future post, and I look forward to reading it.