NaiveTortoise comments on New safety research agenda: scalable agent alignment via reward modeling

NaiveTortoise 9 Dec 2018 19:26 UTC
LW: 3 AF: 1
AF
Was anyone else unconvinced/confused (I was charitably confused, uncharitably unconvinced) by the analogy between recursive task/agent decomposition and first-order logic in section 3 under the heading “Analogy to Complexity Theory”? I suspect I’m missing something but I don’t see how recursive decomposition is analogous to **alternating** quantifiers?
It’s obvious that, at the first level, finding an $x$ that satisfies $ϕ (x)$ is similar to finding the right action, but I don’t see how finding $x$ and $y$ that satisfy $\exists x \forall y ϕ (x, y)$ is similar to $A_{2}$ ‘s solving of one of $A_{1}$ ’s decomposed tasks is similar to universal quantification.
To take a very basic example, if I ask an agent to solve a simple problem like, “what is 1+2+3+4?” and the first agent decomposes it into “what is 1+2?”, what “what is 3+4?”, and “what is the result of ‘1+2’ plus the result of ‘3+4’?” (this assumes we have some mechanism of pointing and specifying dependencies like Ought’s working on), what would this look like in the alternating quantifier formulation?
- janleike 31 Dec 2018 23:54 UTC
  LW: 12 AF: 7
  AF Parent
  Thanks for your question! I suspect there is some confusion going on here with what recursive reward modeling is. The example that you describe sounds like an example from imitating expert reasoning.
  In recursive reward modeling, agent $A_{1}$ is not decomposing tasks, it is trying to achieve some objective that the user intends for it to perform. $A_{2}$ then assists the human in evaluating $A_{1}$ ’s behavior in order to train a reward model. Decomposition only happens on the evaluation of $A_{1}$ ’s task.
  For example, $A_{1}$ proposes some plan $x$ and $A_{2}$ proposes the largest weakness $y$ in the plan. The human then evaluates whether $y$ is indeed a weakness in the plan $x$ and how strong it is, and then judges the plan $x$ based on this weakness. If you simplify and assume this judgement is binary ( $ϕ (x, y)$ is true iff the plan passes), then $A_{1}$ “wins” iff $ϕ (x, y)$ and $A_{2}$ “wins” iff $\neg ϕ (x, y)$ . Thus the objective of the game becomes $\exists x \forall y . ϕ (x, y)$ for $A_{1}$ and $\neg \exists x \forall y ϕ (x, y)$ for $A_{2}$ . Note that this formulation has similarities with debate. However, in practice judgements don’t need to be binary and there are a bunch of other differences (human closer in the loop, not limited to text, etc.).
  - NaiveTortoise 1 Jan 2019 0:27 UTC
    LW: 6 AF: 3
    AF Parent
    Thanks a lot! This definitely clears things up and also highlights the difference between recursive reward modeling and typical amplification/the expert imitation approach you mentioned.
- paulfchristiano 19 Dec 2018 22:32 UTC
  LW: 3 AF: 2
  AF Parent
  Finding the action $a$ that optimizes a reward function $r (a)$ is $N P$ -complete for general $r$ . If the reward function $r$ is itself able to use an oracle for $N P$ , then that’s complete for ${N P}^{N P}$ , and so on. The analogy is loose because you aren’t really getting the optimal $a$ at each step.