Thanks for this! I found it interesting and useful.
I don’t have much specific feedback, partly because I listened to this via Nonlinear Library while doing other things rather than reading it, but I’ll share some thoughts anyway since you indicated being very keen for feedback.
I in general think this sort of distillation work is important and under-supplied
This seems like a good example of what this sort of distillation work should be like—broken into different posts that can be read separately, starting with an overall overview, each post is broken down into clear and logical sections and subsections, use of bold, clarity about terms, addition of meta notes where relevant
Maybe it would’ve been useful to just name & link to sources on threat models, agendas to build safe AGI, and robustly good approaches that you don’t discuss in any further detail? Rathe than not mentioning them at all.
That could make it easier for people to dive deeper if they want, could help avoid giving the impression that the things you list are the only things in those categories, and could help people understand what you mean by the overall categories by seeing more examples of things within the categories.
This is assuming you think there are other discernible nameable constituents of those categories which you didn’t name—I guess it’s possible that you don’t think that.
I’ll put in a reply to this comment the Anki cards I made, on the off chance that that’s of interest to you as oblique feedback or of interest to other people so they can use the same cards themselves
Thanks a lot for the feedback, and the Anki cards! Appreciated. I definitely find that level of feedback motivating :)
These categories were formed by a vague combination of “what things do I hear people talking about/researching” and “what do I understand well enough that I can write intelligent summaries of it”—this is heavily constrained by what I have and have not read! (I am much less good than Rohin Shah at reading everything in Alignment :’( )
Eg, Steve Byrnes does a bunch of research that seems potentially cool, but I haven’t read much of it, and don’t have a good sense of what it’s actually about, so I didn’t talk about it. And this is not expressing an opinion that, Eg, his research is bad.
I’ve updated towards including a section at the end of each post/section with “stuff that seems maybe relevant that I haven’t read enough to feel comfortable summarising”
Nanda broadly sees there as being 5 main types of approach to alignment research.
Addressing threat models: We keep a specific threat model in mind for how AGI causes an existential catastrophe, and focus our work on things that we expect will help address the threat model.
Agendas to build safe AGI: Let’s make specific plans for how to actually build safe AGI, and then try to test, implement, and understand the limitations of these plans. With an emphasis on understanding how to build AGI safely, rather than trying to do it as fast as possible.
Robustly good approaches: In the long-run AGI will clearly be important, but we’re highly uncertain about how we’ll get there and what, exactly, could go wrong. So let’s do work that seems good in many possible scenarios, and doesn’t rely on having a specific story in mind. Interpretability work is a good example of this.
De-confusion: Reasoning about how to align AGI involves reasoning about complex concepts, such as intelligence, alignment and values, and we’re pretty confused about what these even mean. This means any work we do now is plausibly not helpful and definitely not reliable. As such, our priority should be to do some conceptual work on how to think about these concepts and what we’re aiming for, and trying to become less confused. I consider the process of coming up with each of the research motivations outlined in this post to be examples of good de-confusion work
Field-building: One of the biggest factors in how much Alignment work gets done is how many researchers are working on it, so a major priority is building the field. This is especially valuable if you think we’re confused about what work needs to be done now, but will eventually have a clearer idea once we’re within a few years of AGI. When this happens, we want a large community of capable, influential and thoughtful people doing Alignment work.
Nanda focuses on three threat models that he thinks are most prominent and are addressed by most current research:
Power-Seeking AI
You get what you measure [The case given by Paul Christiano in What Failure Looks Like (Part 1)]
AI Influenced Coordination Failures [The case put forward by Andrew Critch, eg in What multipolar failure looks like. Many players get AGI around the same time. They now need to coordinate and cooperate with each other and the AGIs, but coordination is an extremely hard problem. We currently deal with this with a range of existing international norms and institutions, but a world with AGI will be sufficiently different that many of these will no longer apply, and we will leave our current stable equilibrium. This is such a different and complex world that things go wrong, and humans are caught in the cross-fire.]
Nanda considers three agendas to build safe AGI to be most prominent:
Iterated Distillation and Amplification (IDA)
AI Safety via Debate
Solving Assistance Games [This is Stuart Russell’s agenda, which argues for a perspective shift in AI towards a more human-centric approach.]
Nanda highlights 3 “robustly good approaches” (in the context of AGI risk):
Interpretability
Robustness
Forecasting
[I doubt he sees these as exhaustive—though that’s possible—and I’m not sure if he sees them as the most important/prominent/most central examples.]
Thanks for this! I found it interesting and useful.
I don’t have much specific feedback, partly because I listened to this via Nonlinear Library while doing other things rather than reading it, but I’ll share some thoughts anyway since you indicated being very keen for feedback.
I in general think this sort of distillation work is important and under-supplied
This seems like a good example of what this sort of distillation work should be like—broken into different posts that can be read separately, starting with an overall overview, each post is broken down into clear and logical sections and subsections, use of bold, clarity about terms, addition of meta notes where relevant
Maybe it would’ve been useful to just name & link to sources on threat models, agendas to build safe AGI, and robustly good approaches that you don’t discuss in any further detail? Rathe than not mentioning them at all.
That could make it easier for people to dive deeper if they want, could help avoid giving the impression that the things you list are the only things in those categories, and could help people understand what you mean by the overall categories by seeing more examples of things within the categories.
This is assuming you think there are other discernible nameable constituents of those categories which you didn’t name—I guess it’s possible that you don’t think that.
I’ll put in a reply to this comment the Anki cards I made, on the off chance that that’s of interest to you as oblique feedback or of interest to other people so they can use the same cards themselves
Thanks a lot for the feedback, and the Anki cards! Appreciated. I definitely find that level of feedback motivating :)
These categories were formed by a vague combination of “what things do I hear people talking about/researching” and “what do I understand well enough that I can write intelligent summaries of it”—this is heavily constrained by what I have and have not read! (I am much less good than Rohin Shah at reading everything in Alignment :’( )
Eg, Steve Byrnes does a bunch of research that seems potentially cool, but I haven’t read much of it, and don’t have a good sense of what it’s actually about, so I didn’t talk about it. And this is not expressing an opinion that, Eg, his research is bad.
I’ve updated towards including a section at the end of each post/section with “stuff that seems maybe relevant that I haven’t read enough to feel comfortable summarising”
My Anki cards
Nanda broadly sees there as being 5 main types of approach to alignment research.
Nanda focuses on three threat models that he thinks are most prominent and are addressed by most current research:
Nanda considers three agendas to build safe AGI to be most prominent:
Nanda highlights 3 “robustly good approaches” (in the context of AGI risk):