Nanda broadly sees there as being 5 main types of approach to alignment research.
Addressing threat models: We keep a specific threat model in mind for how AGI causes an existential catastrophe, and focus our work on things that we expect will help address the threat model.
Agendas to build safe AGI: Let’s make specific plans for how to actually build safe AGI, and then try to test, implement, and understand the limitations of these plans. With an emphasis on understanding how to build AGI safely, rather than trying to do it as fast as possible.
Robustly good approaches: In the long-run AGI will clearly be important, but we’re highly uncertain about how we’ll get there and what, exactly, could go wrong. So let’s do work that seems good in many possible scenarios, and doesn’t rely on having a specific story in mind. Interpretability work is a good example of this.
De-confusion: Reasoning about how to align AGI involves reasoning about complex concepts, such as intelligence, alignment and values, and we’re pretty confused about what these even mean. This means any work we do now is plausibly not helpful and definitely not reliable. As such, our priority should be to do some conceptual work on how to think about these concepts and what we’re aiming for, and trying to become less confused. I consider the process of coming up with each of the research motivations outlined in this post to be examples of good de-confusion work
Field-building: One of the biggest factors in how much Alignment work gets done is how many researchers are working on it, so a major priority is building the field. This is especially valuable if you think we’re confused about what work needs to be done now, but will eventually have a clearer idea once we’re within a few years of AGI. When this happens, we want a large community of capable, influential and thoughtful people doing Alignment work.
Nanda focuses on three threat models that he thinks are most prominent and are addressed by most current research:
Power-Seeking AI
You get what you measure [The case given by Paul Christiano in What Failure Looks Like (Part 1)]
AI Influenced Coordination Failures [The case put forward by Andrew Critch, eg in What multipolar failure looks like. Many players get AGI around the same time. They now need to coordinate and cooperate with each other and the AGIs, but coordination is an extremely hard problem. We currently deal with this with a range of existing international norms and institutions, but a world with AGI will be sufficiently different that many of these will no longer apply, and we will leave our current stable equilibrium. This is such a different and complex world that things go wrong, and humans are caught in the cross-fire.]
Nanda considers three agendas to build safe AGI to be most prominent:
Iterated Distillation and Amplification (IDA)
AI Safety via Debate
Solving Assistance Games [This is Stuart Russell’s agenda, which argues for a perspective shift in AI towards a more human-centric approach.]
Nanda highlights 3 “robustly good approaches” (in the context of AGI risk):
Interpretability
Robustness
Forecasting
[I doubt he sees these as exhaustive—though that’s possible—and I’m not sure if he sees them as the most important/prominent/most central examples.]
My Anki cards
Nanda broadly sees there as being 5 main types of approach to alignment research.
Nanda focuses on three threat models that he thinks are most prominent and are addressed by most current research:
Nanda considers three agendas to build safe AGI to be most prominent:
Nanda highlights 3 “robustly good approaches” (in the context of AGI risk):