janleike comments on New safety research agenda: scalable agent alignment via reward modeling

janleike 31 Dec 2018 23:54 UTC
LW: 12 AF: 7
0
AF
Thanks for your question! I suspect there is some confusion going on here with what recursive reward modeling is. The example that you describe sounds like an example from imitating expert reasoning.
In recursive reward modeling, agent $A_{1}$ is not decomposing tasks, it is trying to achieve some objective that the user intends for it to perform. $A_{2}$ then assists the human in evaluating $A_{1}$ ’s behavior in order to train a reward model. Decomposition only happens on the evaluation of $A_{1}$ ’s task.
For example, $A_{1}$ proposes some plan $x$ and $A_{2}$ proposes the largest weakness $y$ in the plan. The human then evaluates whether $y$ is indeed a weakness in the plan $x$ and how strong it is, and then judges the plan $x$ based on this weakness. If you simplify and assume this judgement is binary ( $ϕ (x, y)$ is true iff the plan passes), then $A_{1}$ “wins” iff $ϕ (x, y)$ and $A_{2}$ “wins” iff $\neg ϕ (x, y)$ . Thus the objective of the game becomes $\exists x \forall y . ϕ (x, y)$ for $A_{1}$ and $\neg \exists x \forall y ϕ (x, y)$ for $A_{2}$ . Note that this formulation has similarities with debate. However, in practice judgements don’t need to be binary and there are a bunch of other differences (human closer in the loop, not limited to text, etc.).
- NaiveTortoise 1 Jan 2019 0:27 UTC
  LW: 6 AF: 3
  0
  AF Parent
  Thanks a lot! This definitely clears things up and also highlights the difference between recursive reward modeling and typical amplification/the expert imitation approach you mentioned.