Thanks for the feedback! That makes sense, I’ve updated the intro paragraph to that section to:
There are a range of agendas proposed for how we might build safe AGI, though note that each agenda is far from a complete and concrete plan. I think of them more as a series of confusions to explore and assumptions to test, with the eventual goal of making a concrete plan. I focus on three agendas here, these are just the three I know the most about, have seen the most work on and, in my subjective judgement, the ones it is most worth newcomers to the field learning about. This is not intended to be comprehensive, see eg Evan Hubinger’s Overview of 11 proposals for building safe advanced AI for more.
Does that seem better?
For what it’s worth, my main bar was a combination of ‘do I understand this agenda well enough to write a summary’ and ‘do I associate at least one researcher and some concrete work with this agenda’. I wouldn’t think of corrigibility as passing the second bar, since I’ve only seen it come up as a term to reason about or aim for, rather than as a fully-fledged plan for how to produce corrigible systems. It’s very possible I’ve missed out on some important work though, and I’d love to hear pushback on this
Bit surprised that you can think of no researchers to associate with Corrigibility. MIRI have written concrete work about it and so has Christiano. It is a major theme in Bostrom’s Superintelligence, and it also appears under the phrasing ‘problem of control’ in Russell’s Human Compatible.
In terms of the history of ideas of the field, I think it that corrigibility is a key motivating concept for newcomers to be aware of. See this writeup on corrigibility, which I wrote in part for newcomers, for links to broader work on corrigibility.
I’ve only seen it come up as a term to reason about or aim for, rather than as a fully-fledged plan for how to produce corrigible systems.
My current reading of the field is that Christiano believes that corrigibility will appear as an emergent property as a result of building an aligned AGI according to his agenda, while MIRI on the other hand (or at least 2021 Yudkowsky) have abandoned the MIRI 2015 plans/agenda to produce corrigibility, and now despair about anybody else ever producing corrigibility either. The CIRL method discussed by Russell produces a type of corrigibility, but as Russell and Hadfield-Menell point out, this type decays as the agent learns more, so it is not a full solution.
I have written a few papers which have the most fully fledged plans that I am aware of, when it comes to producing (a pretty useful and stable version of) AGI corrigibility. This sequence is probably the most accessible introduction to these papers.
Thanks for the feedback! That makes sense, I’ve updated the intro paragraph to that section to:
Does that seem better?
For what it’s worth, my main bar was a combination of ‘do I understand this agenda well enough to write a summary’ and ‘do I associate at least one researcher and some concrete work with this agenda’. I wouldn’t think of corrigibility as passing the second bar, since I’ve only seen it come up as a term to reason about or aim for, rather than as a fully-fledged plan for how to produce corrigible systems. It’s very possible I’ve missed out on some important work though, and I’d love to hear pushback on this
Thanks, yes that new phrasing is better.
Bit surprised that you can think of no researchers to associate with Corrigibility. MIRI have written concrete work about it and so has Christiano. It is a major theme in Bostrom’s Superintelligence, and it also appears under the phrasing ‘problem of control’ in Russell’s Human Compatible.
In terms of the history of ideas of the field, I think it that corrigibility is a key motivating concept for newcomers to be aware of. See this writeup on corrigibility, which I wrote in part for newcomers, for links to broader work on corrigibility.
My current reading of the field is that Christiano believes that corrigibility will appear as an emergent property as a result of building an aligned AGI according to his agenda, while MIRI on the other hand (or at least 2021 Yudkowsky) have abandoned the MIRI 2015 plans/agenda to produce corrigibility, and now despair about anybody else ever producing corrigibility either. The CIRL method discussed by Russell produces a type of corrigibility, but as Russell and Hadfield-Menell point out, this type decays as the agent learns more, so it is not a full solution.
I have written a few papers which have the most fully fledged plans that I am aware of, when it comes to producing (a pretty useful and stable version of) AGI corrigibility. This sequence is probably the most accessible introduction to these papers.