Was anyone else unconvinced/confused (I was charitably confused, uncharitably unconvinced) by the analogy between recursive task/agent decomposition and first-order logic in section 3 under the heading “Analogy to Complexity Theory”? I suspect I’m missing something but I don’t see how recursive decomposition is analogous to **alternating** quantifiers?
It’s obvious that, at the first level, finding an x that satisfies ϕ(x) is similar to finding the right action, but I don’t see how finding x and y that satisfy ∃x∀yϕ(x,y) is similar to A2‘s solving of one of A1’s decomposed tasks is similar to universal quantification.
To take a very basic example, if I ask an agent to solve a simple problem like, “what is 1+2+3+4?” and the first agent decomposes it into “what is 1+2?”, what “what is 3+4?”, and “what is the result of ‘1+2’ plus the result of ‘3+4’?” (this assumes we have some mechanism of pointing and specifying dependencies like Ought’s working on), what would this look like in the alternating quantifier formulation?
Thanks for your question! I suspect there is some confusion going on here with what recursive reward modeling is. The example that you describe sounds like an example from imitating expert reasoning.
In recursive reward modeling, agent A1 is not decomposing tasks, it is trying to achieve some objective that the user intends for it to perform.A2 then assists the human in evaluating A1’s behavior in order to train a reward model. Decomposition only happens on the evaluation of A1’s task.
For example, A1 proposes some plan x and A2 proposes the largest weakness y in the plan. The human then evaluates whether y is indeed a weakness in the plan x and how strong it is, and then judges the plan x based on this weakness. If you simplify and assume this judgement is binary (ϕ(x,y) is true iff the plan passes), then A1 “wins” iff ϕ(x,y) and A2 “wins” iff ¬ϕ(x,y). Thus the objective of the game becomes ∃x∀y.ϕ(x,y) for A1 and ¬∃x∀yϕ(x,y) for A2. Note that this formulation has similarities with debate. However, in practice judgements don’t need to be binary and there are a bunch of other differences (human closer in the loop, not limited to text, etc.).
Thanks a lot! This definitely clears things up and also highlights the difference between recursive reward modeling and typical amplification/the expert imitation approach you mentioned.
Finding the action a that optimizes a reward function r(a) is NP-complete for general r. If the reward function r is itself able to use an oracle for NP, then that’s complete for NPNP, and so on. The analogy is loose because you aren’t really getting the optimal a at each step.
Was anyone else unconvinced/confused (I was charitably confused, uncharitably unconvinced) by the analogy between recursive task/agent decomposition and first-order logic in section 3 under the heading “Analogy to Complexity Theory”? I suspect I’m missing something but I don’t see how recursive decomposition is analogous to **alternating** quantifiers?
It’s obvious that, at the first level, finding an x that satisfies ϕ(x) is similar to finding the right action, but I don’t see how finding x and y that satisfy ∃x∀y ϕ(x,y) is similar to A2‘s solving of one of A1’s decomposed tasks is similar to universal quantification.
To take a very basic example, if I ask an agent to solve a simple problem like, “what is 1+2+3+4?” and the first agent decomposes it into “what is 1+2?”, what “what is 3+4?”, and “what is the result of ‘1+2’ plus the result of ‘3+4’?” (this assumes we have some mechanism of pointing and specifying dependencies like Ought’s working on), what would this look like in the alternating quantifier formulation?
Thanks for your question! I suspect there is some confusion going on here with what recursive reward modeling is. The example that you describe sounds like an example from imitating expert reasoning.
In recursive reward modeling, agent A1 is not decomposing tasks, it is trying to achieve some objective that the user intends for it to perform.A2 then assists the human in evaluating A1’s behavior in order to train a reward model. Decomposition only happens on the evaluation of A1’s task.
For example, A1 proposes some plan x and A2 proposes the largest weakness y in the plan. The human then evaluates whether y is indeed a weakness in the plan x and how strong it is, and then judges the plan x based on this weakness. If you simplify and assume this judgement is binary (ϕ(x,y) is true iff the plan passes), then A1 “wins” iff ϕ(x,y) and A2 “wins” iff ¬ϕ(x,y). Thus the objective of the game becomes ∃x∀y.ϕ(x,y) for A1 and ¬∃x∀yϕ(x,y) for A2. Note that this formulation has similarities with debate. However, in practice judgements don’t need to be binary and there are a bunch of other differences (human closer in the loop, not limited to text, etc.).
Thanks a lot! This definitely clears things up and also highlights the difference between recursive reward modeling and typical amplification/the expert imitation approach you mentioned.
Finding the action a that optimizes a reward function r(a) is NP-complete for general r. If the reward function r is itself able to use an oracle for NP, then that’s complete for NPNP, and so on. The analogy is loose because you aren’t really getting the optimal a at each step.