Thanks for your question! I suspect there is some confusion going on here with what recursive reward modeling is. The example that you describe sounds like an example from imitating expert reasoning.
In recursive reward modeling, agent A1 is not decomposing tasks, it is trying to achieve some objective that the user intends for it to perform.A2 then assists the human in evaluating A1’s behavior in order to train a reward model. Decomposition only happens on the evaluation of A1’s task.
For example, A1 proposes some plan x and A2 proposes the largest weakness y in the plan. The human then evaluates whether y is indeed a weakness in the plan x and how strong it is, and then judges the plan x based on this weakness. If you simplify and assume this judgement is binary (ϕ(x,y) is true iff the plan passes), then A1 “wins” iff ϕ(x,y) and A2 “wins” iff ¬ϕ(x,y). Thus the objective of the game becomes ∃x∀y.ϕ(x,y) for A1 and ¬∃x∀yϕ(x,y) for A2. Note that this formulation has similarities with debate. However, in practice judgements don’t need to be binary and there are a bunch of other differences (human closer in the loop, not limited to text, etc.).
Thanks a lot! This definitely clears things up and also highlights the difference between recursive reward modeling and typical amplification/the expert imitation approach you mentioned.
Thanks for your question! I suspect there is some confusion going on here with what recursive reward modeling is. The example that you describe sounds like an example from imitating expert reasoning.
In recursive reward modeling, agent A1 is not decomposing tasks, it is trying to achieve some objective that the user intends for it to perform.A2 then assists the human in evaluating A1’s behavior in order to train a reward model. Decomposition only happens on the evaluation of A1’s task.
For example, A1 proposes some plan x and A2 proposes the largest weakness y in the plan. The human then evaluates whether y is indeed a weakness in the plan x and how strong it is, and then judges the plan x based on this weakness. If you simplify and assume this judgement is binary (ϕ(x,y) is true iff the plan passes), then A1 “wins” iff ϕ(x,y) and A2 “wins” iff ¬ϕ(x,y). Thus the objective of the game becomes ∃x∀y.ϕ(x,y) for A1 and ¬∃x∀yϕ(x,y) for A2. Note that this formulation has similarities with debate. However, in practice judgements don’t need to be binary and there are a bunch of other differences (human closer in the loop, not limited to text, etc.).
Thanks a lot! This definitely clears things up and also highlights the difference between recursive reward modeling and typical amplification/the expert imitation approach you mentioned.