This is a sequence curated by Paul Christiano on one current approach to alignment: Iterated Amplification.
Iterated Amplification
Preface to the sequence on iterated amplification
0. Problem statement
The first part of this sequence clarifies the problem that iterated amplification is trying to solve, which is both narrower and broader than you might expect.
The Steering Problem
Clarifying “AI Alignment”
An unaligned benchmark
Prosaic AI alignment
1. Basic intuition
The second part of the sequence outlines the basic intuitions that motivate iterated amplification. I think that these intuitions may be more important than the scheme itself, but they are considerably more informal.
Approval-directed agents
Approval-directed bootstrapping
Humans Consulting HCH
Corrigibility
2. The scheme
The core of the sequence is the third section. Benign model-free RL describes iterated amplification, as a general outline into which we can substitute arbitrary algorithms for reward learning, amplification, and robustness. The first four posts all describe variants of this idea from different perspectives, and if you find that one of those descriptions is clearest for you then I recommend focusing on that one and skimming the others.
Iterated Distillation and Amplification
Benign model-free RL
Factored Cognition
Supervising strong learners by amplifying weak experts
AlphaGo Zero and capability amplification
3. What needs doing
The fourth part of the sequence describes some of the black boxes in iterated amplification and discusses what we would need to do to fill in those boxes. I think these are some of the most important open questions in AI alignment.
Directions and desiderata for AI alignment
The reward engineering problem
Capability amplification
Learning with catastrophes
4. Possible approaches
The fifth section of the sequence breaks down some of these problems further and describes some possible approaches.