janleike comments on New safety research agenda: scalable agent alignment via reward modeling

janleike 31 Dec 2018 23:48 UTC
LW: 24 AF: 10
AF
Good question. The short answer is “I’m not entirely sure.” Other people seem to struggle with understanding Paul Christiano’s agenda as well.
When we developed the ideas around recursive reward modeling, we understood amplification to be quite different (what we ended up calling Imitating expert reasoning in the paper after consulting with Paul Christiano and Andreas Stuhlmüller). I personally find that the clearest expositions for what Paul is trying to do are Iterated Distillation and Amplification and Paul’s latest paper, which we compare to in multiple places in the paper. But I’m not sure how that fits into Paul’s overall “agenda”.
My understanding of Paul’s agenda is that it revolves around “amplification” which is a broad framework for training ML systems with a human in the loop. Debate is an instance of amplification. Factored cognition is an instance of amplification. Imitating expert reasoning is an instance of amplification. Recursive reward modeling is an instance of amplification. AlphaGo is an instance of amplification. It’s not obvious to me what isn’t.
Having said that, there is no doubt about the fact that Paul is a very brilliant researcher who is clearly doing great work on alignment. His comments and feedback have been very helpful for writing this paper and I’m very much looking forward to what he’ll produce next.
So maybe I should bounce this question over to @paulfchristiano: How does recursive reward modeling fit into your agenda?
- paulfchristiano 2 Jan 2019 6:22 UTC
  LW: 44 AF: 14
  AF Parent
  Iterated amplification is a very general framework, describing algorithms with two pieces:
  - An amplification procedure that increases an agent’s capability. (The main candidates involve decomposing a task into pieces and invoking the agent to solve each piece separately, but there are lots of ways to do that.)
  - A distillation procedure that uses a strong expert to train an efficient agent. (I usually consider semi-supervised RL, as in our paper.)
  Given these two pieces, we plug them into each other: the output of distillation becomes the input to amplification, the output of amplification becomes the input to distillation. You kick off the process with something aligned, or else design the amplification step so that it works from some arbitrary initialization.
  The hope is that the result is aligned because:
  - Amplification preserves alignment (or benigness, corrigibility, or some similar invariant)
  - Distillation preserves alignment, as long the expert is “smart enough” (relative to the agent they are training)
  - Amplification produces “smart enough” agents.
  My research is organized around this structure—thinking about how to fill in the various pieces, about how to analyze training procedures that have this shape, about what the most likely difficulties are. For me, the main appeal of this structure is that it breaks the full problem of training an aligned AI into two subproblems which are both superficially easier (though my expectation is that at least one of amplification or distillation will end up containing almost all of the difficulty).
  Recursive reward modeling fits in this framework, though my understanding is that it was arrived at mostly independently. I hope that work on iterated-amplification-in-general will be useful for analyzing recursive reward modeling, and conversely expect that experience with recursive reward learning will be informative about the prospects for iterated-amplification-in-general.
  It’s not obvious to me what isn’t.
  Iterated amplification is intended to describe the kind of training procedure that is most natural using contemporary ML techniques. I think it’s quite likely that training strategies will have this form, even if people never read anything I write. (And indeed, AGZ and ExIt were published around the same time.)
  Introducing this concept was mostly intended as an analysis tool rather than a flag planting exercise (and indeed I haven’t done the kind of work that would be needed to plant a flag). From the prior position of “who knows how we might train aligned AI,” iterated amplification really does narrow down the space of possibilities a lot, and I think it has allowed my research to get much more concrete much faster than it otherwise would have.
  I think it was probably naive to hope to separate this kind of analysis from flag planting without being much more careful about it; I hope I haven’t made it too much more difficult for others to get due credit for working on ideas that happen to fit in this framework.
  Debate is an instance of amplification.
  Debate isn’t prima facie an instance of iterated amplification, i.e. it doesn’t fit in the framework I outlined at the start of this comment.
  Geoffrey and I both believe that debate is nearly equivalent to iterated amplification, in the sense that probably either they will both work or neither will. So the two approaches suggest very similar research questions. This makes us somewhat more optimistic that those research questions are good ones to work on.
  Factored cognition is an instance of amplification
  “Factored cognition” refers to mechanisms for decomposing sophisticated reasoning into smaller tasks (quoting the link). Such mechanisms could be used for amplification, though there are other reasons you might study factored cognition.
  “amplification” which is a broad framework for training ML systems with a human in the loop
  The human isn’t really an essential part. I think it’s reasonably likely that we will use iterated amplification starting from a simple “core” for corrigible reasoning rather than starting from a human. (Though the resulting systems will presumably interact extensively with humans.)
  What links here?