Wei Dai comments on New safety research agenda: scalable agent alignment via reward modeling

Wei Dai 20 Nov 2018 18:37 UTC
LW: 29 AF: 13
AF
Am I correct in thinking that this is a subset of Paul’s agenda, since he has talked about something very similar (including recursive scaling and reward modeling) as a possible instantiation of his approach? Do you see any substantial differences between what you’re proposing and what he proposed in that post?

(If it is, I guess I’m simultaneously happy that more resources will be devoted to Paul’s agenda, which definitely could use it, and disappointed for the implication that a more promising (or even just another uncorrelated) ML-based approach to alignment probably isn’t forthcoming from DeepMind.)
What links here?
- OpenAI Launches Superalignment Taskforce by Zvi (11 Jul 2023 13:00 UTC; 149 points)
- What are the differences between all the iterative/recursive approaches to AI alignment? by riceissa (21 Sep 2019 2:09 UTC; 30 points)
- janleike 31 Dec 2018 23:48 UTC
  LW: 24 AF: 10
  AF Parent
  Good question. The short answer is “I’m not entirely sure.” Other people seem to struggle with understanding Paul Christiano’s agenda as well.
  When we developed the ideas around recursive reward modeling, we understood amplification to be quite different (what we ended up calling Imitating expert reasoning in the paper after consulting with Paul Christiano and Andreas Stuhlmüller). I personally find that the clearest expositions for what Paul is trying to do are Iterated Distillation and Amplification and Paul’s latest paper, which we compare to in multiple places in the paper. But I’m not sure how that fits into Paul’s overall “agenda”.
  My understanding of Paul’s agenda is that it revolves around “amplification” which is a broad framework for training ML systems with a human in the loop. Debate is an instance of amplification. Factored cognition is an instance of amplification. Imitating expert reasoning is an instance of amplification. Recursive reward modeling is an instance of amplification. AlphaGo is an instance of amplification. It’s not obvious to me what isn’t.
  Having said that, there is no doubt about the fact that Paul is a very brilliant researcher who is clearly doing great work on alignment. His comments and feedback have been very helpful for writing this paper and I’m very much looking forward to what he’ll produce next.
  So maybe I should bounce this question over to @paulfchristiano: How does recursive reward modeling fit into your agenda?
  - paulfchristiano 2 Jan 2019 6:22 UTC
    LW: 44 AF: 14
    AF Parent
    Iterated amplification is a very general framework, describing algorithms with two pieces:
    An amplification procedure that increases an agent’s capability. (The main candidates involve decomposing a task into pieces and invoking the agent to solve each piece separately, but there are lots of ways to do that.)
    A distillation procedure that uses a strong expert to train an efficient agent. (I usually consider semi-supervised RL, as in our paper.)
    Given these two pieces, we plug them into each other: the output of distillation becomes the input to amplification, the output of amplification becomes the input to distillation. You kick off the process with something aligned, or else design the amplification step so that it works from some arbitrary initialization.
    The hope is that the result is aligned because:
    Amplification preserves alignment (or benigness, corrigibility, or some similar invariant)
    Distillation preserves alignment, as long the expert is “smart enough” (relative to the agent they are training)
    Amplification produces “smart enough” agents.
    My research is organized around this structure—thinking about how to fill in the various pieces, about how to analyze training procedures that have this shape, about what the most likely difficulties are. For me, the main appeal of this structure is that it breaks the full problem of training an aligned AI into two subproblems which are both superficially easier (though my expectation is that at least one of amplification or distillation will end up containing almost all of the difficulty).
    Recursive reward modeling fits in this framework, though my understanding is that it was arrived at mostly independently. I hope that work on iterated-amplification-in-general will be useful for analyzing recursive reward modeling, and conversely expect that experience with recursive reward learning will be informative about the prospects for iterated-amplification-in-general.
    It’s not obvious to me what isn’t.
    Iterated amplification is intended to describe the kind of training procedure that is most natural using contemporary ML techniques. I think it’s quite likely that training strategies will have this form, even if people never read anything I write. (And indeed, AGZ and ExIt were published around the same time.)
    Introducing this concept was mostly intended as an analysis tool rather than a flag planting exercise (and indeed I haven’t done the kind of work that would be needed to plant a flag). From the prior position of “who knows how we might train aligned AI,” iterated amplification really does narrow down the space of possibilities a lot, and I think it has allowed my research to get much more concrete much faster than it otherwise would have.
    I think it was probably naive to hope to separate this kind of analysis from flag planting without being much more careful about it; I hope I haven’t made it too much more difficult for others to get due credit for working on ideas that happen to fit in this framework.
    Debate is an instance of amplification.
    Debate isn’t prima facie an instance of iterated amplification, i.e. it doesn’t fit in the framework I outlined at the start of this comment.
    Geoffrey and I both believe that debate is nearly equivalent to iterated amplification, in the sense that probably either they will both work or neither will. So the two approaches suggest very similar research questions. This makes us somewhat more optimistic that those research questions are good ones to work on.
    Factored cognition is an instance of amplification
    “Factored cognition” refers to mechanisms for decomposing sophisticated reasoning into smaller tasks (quoting the link). Such mechanisms could be used for amplification, though there are other reasons you might study factored cognition.
    “amplification” which is a broad framework for training ML systems with a human in the loop
    The human isn’t really an essential part. I think it’s reasonably likely that we will use iterated amplification starting from a simple “core” for corrigible reasoning rather than starting from a human. (Though the resulting systems will presumably interact extensively with humans.)
    What links here?
    AI Alignment 2018-19 Review by Rohin Shah (28 Jan 2020 2:19 UTC; 126 points)
    List of resolved confusions about IDA by Wei Dai (30 Sep 2019 20:03 UTC; 97 points)
    A guide to Iterated Amplification & Debate by Rafael Harth (15 Nov 2020 17:14 UTC; 75 points)
    Rohin Shah's comment on New paper: (When) is Truth-telling Favored in AI debate? by VojtaKovarik (12 Jan 2020 20:41 UTC; 11 points)
    Vladimir_Nesov's comment on Two More Decision Theory Problems for Humans by Wei Dai (5 Jan 2019 11:18 UTC; 4 points)
- Dr_Manhattan 20 Nov 2018 21:20 UTC
  LW: 6 AF: 2
  AF Parent
  They mention and link to iterated amplification in the Medium article.
  Scaling up
  In the long run, we would like to scale reward modeling to domains that are too complex for humans to evaluate directly. To do this, we need to boost the user’s ability to evaluate outcomes. We discuss how reward modeling can be applied recursively: we can use reward modeling to train agents to assist the user in the evaluation process itself. If evaluation is easier than behavior, this could allow us to bootstrap from simpler tasks to increasingly general and more complex tasks. This can be thought of as an instance of iterated amplification.
  - Wei Dai 20 Nov 2018 21:59 UTC
    LW: 17 AF: 10
    AF Parent
    Yes, and they cite iterated amplification in their paper as well, but I’m trying to figure out if they’re proposing anything new, because the title here is “New safety research agenda: scalable agent alignment via reward modeling” but Paul’s post that I linked to already proposed recursively applying reward modeling. Seems like either I’m missing something, or they didn’t read that post?