Bit surprised that you can think of no researchers to associate with Corrigibility. MIRI have written concrete work about it and so has Christiano. It is a major theme in Bostrom’s Superintelligence, and it also appears under the phrasing ‘problem of control’ in Russell’s Human Compatible.
In terms of the history of ideas of the field, I think it that corrigibility is a key motivating concept for newcomers to be aware of. See this writeup on corrigibility, which I wrote in part for newcomers, for links to broader work on corrigibility.
I’ve only seen it come up as a term to reason about or aim for, rather than as a fully-fledged plan for how to produce corrigible systems.
My current reading of the field is that Christiano believes that corrigibility will appear as an emergent property as a result of building an aligned AGI according to his agenda, while MIRI on the other hand (or at least 2021 Yudkowsky) have abandoned the MIRI 2015 plans/agenda to produce corrigibility, and now despair about anybody else ever producing corrigibility either. The CIRL method discussed by Russell produces a type of corrigibility, but as Russell and Hadfield-Menell point out, this type decays as the agent learns more, so it is not a full solution.
I have written a few papers which have the most fully fledged plans that I am aware of, when it comes to producing (a pretty useful and stable version of) AGI corrigibility. This sequence is probably the most accessible introduction to these papers.
Thanks, yes that new phrasing is better.
Bit surprised that you can think of no researchers to associate with Corrigibility. MIRI have written concrete work about it and so has Christiano. It is a major theme in Bostrom’s Superintelligence, and it also appears under the phrasing ‘problem of control’ in Russell’s Human Compatible.
In terms of the history of ideas of the field, I think it that corrigibility is a key motivating concept for newcomers to be aware of. See this writeup on corrigibility, which I wrote in part for newcomers, for links to broader work on corrigibility.
My current reading of the field is that Christiano believes that corrigibility will appear as an emergent property as a result of building an aligned AGI according to his agenda, while MIRI on the other hand (or at least 2021 Yudkowsky) have abandoned the MIRI 2015 plans/agenda to produce corrigibility, and now despair about anybody else ever producing corrigibility either. The CIRL method discussed by Russell produces a type of corrigibility, but as Russell and Hadfield-Menell point out, this type decays as the agent learns more, so it is not a full solution.
I have written a few papers which have the most fully fledged plans that I am aware of, when it comes to producing (a pretty useful and stable version of) AGI corrigibility. This sequence is probably the most accessible introduction to these papers.