paulfchristiano comments on New safety research agenda: scalable agent alignment via reward modeling

paulfchristiano 2 Jan 2019 6:22 UTC
LW: 44 AF: 14
AF
Iterated amplification is a very general framework, describing algorithms with two pieces:
- An amplification procedure that increases an agent’s capability. (The main candidates involve decomposing a task into pieces and invoking the agent to solve each piece separately, but there are lots of ways to do that.)
- A distillation procedure that uses a strong expert to train an efficient agent. (I usually consider semi-supervised RL, as in our paper.)
Given these two pieces, we plug them into each other: the output of distillation becomes the input to amplification, the output of amplification becomes the input to distillation. You kick off the process with something aligned, or else design the amplification step so that it works from some arbitrary initialization.
The hope is that the result is aligned because:
- Amplification preserves alignment (or benigness, corrigibility, or some similar invariant)
- Distillation preserves alignment, as long the expert is “smart enough” (relative to the agent they are training)
- Amplification produces “smart enough” agents.
My research is organized around this structure—thinking about how to fill in the various pieces, about how to analyze training procedures that have this shape, about what the most likely difficulties are. For me, the main appeal of this structure is that it breaks the full problem of training an aligned AI into two subproblems which are both superficially easier (though my expectation is that at least one of amplification or distillation will end up containing almost all of the difficulty).
Recursive reward modeling fits in this framework, though my understanding is that it was arrived at mostly independently. I hope that work on iterated-amplification-in-general will be useful for analyzing recursive reward modeling, and conversely expect that experience with recursive reward learning will be informative about the prospects for iterated-amplification-in-general.
It’s not obvious to me what isn’t.
Iterated amplification is intended to describe the kind of training procedure that is most natural using contemporary ML techniques. I think it’s quite likely that training strategies will have this form, even if people never read anything I write. (And indeed, AGZ and ExIt were published around the same time.)
Introducing this concept was mostly intended as an analysis tool rather than a flag planting exercise (and indeed I haven’t done the kind of work that would be needed to plant a flag). From the prior position of “who knows how we might train aligned AI,” iterated amplification really does narrow down the space of possibilities a lot, and I think it has allowed my research to get much more concrete much faster than it otherwise would have.
I think it was probably naive to hope to separate this kind of analysis from flag planting without being much more careful about it; I hope I haven’t made it too much more difficult for others to get due credit for working on ideas that happen to fit in this framework.
Debate is an instance of amplification.
Debate isn’t prima facie an instance of iterated amplification, i.e. it doesn’t fit in the framework I outlined at the start of this comment.
Geoffrey and I both believe that debate is nearly equivalent to iterated amplification, in the sense that probably either they will both work or neither will. So the two approaches suggest very similar research questions. This makes us somewhat more optimistic that those research questions are good ones to work on.
Factored cognition is an instance of amplification
“Factored cognition” refers to mechanisms for decomposing sophisticated reasoning into smaller tasks (quoting the link). Such mechanisms could be used for amplification, though there are other reasons you might study factored cognition.
“amplification” which is a broad framework for training ML systems with a human in the loop
The human isn’t really an essential part. I think it’s reasonably likely that we will use iterated amplification starting from a simple “core” for corrigible reasoning rather than starting from a human. (Though the resulting systems will presumably interact extensively with humans.)
What links here?