Hey Rohin, I’m writing a review on everything that’ been written on corrigibility so far. Do the “the off switch game”, “Active Inverse Reward Design” “should robots be obedient”, “incorrigibility in CIRL” as well as your reply in the Newsletter represent CHAI’s current views on the subject? If not, which papers contain them?
Uh, I don’t speak for CHAI, and my views differ pretty significantly from e.g. Dylan’s or Stuart’s on several topics. (And other grad students differ even more.) But those seem like reasonable CHAI papers to look at (though I’m not sure how Active IRD relates to corrigibility). Chapter 3 of the Value Learning sequence has some of my takes on reward uncertainty, which probably includes some thoughts about corrigibility somewhere.
Human Compatible also talks about corrigibility iirc, though I think the discussion is pretty similar to the one in the off switch game?
Active IRD doesn’t have anything to do with corrigibility, I guess my mind just switched off when I was writing that. Anyway, how diverse are CHAI’s views on corrigibility? Could you tell me who I should talk to? Because I’ve already read all the published stuff on it if I’m understanding you rightly and I want to make sure that all the perspectives no this topic are covered.
Hmm, I expect each grad student will have a slightly different perspective, but off the top of my head I think Michael Dennis has the most opinions on it. (Other people could include Daniel Filan and Adam Gleave.)
Hmm, of the faculty Stuart spends the most time thinking about AI alignment, I’m not sure how much the other faculty have thought about corrigibility—they’ll have views about the off switch game, but not about MIRI-style corrigibility.
Most of the staff doesn’t work on technical research, so they probably won’t have strong opinions. Exceptions: Critch and Karthika (though I don’t think Karthika has engaged much with corrigibility).
Probably the best way is to find emails of individual researchers online and email them directly. I’ve also left a message on our Slack linking to this discussion.
Hey Rohin, I’m writing a review on everything that’ been written on corrigibility so far. Do the “the off switch game”, “Active Inverse Reward Design” “should robots be obedient”, “incorrigibility in CIRL” as well as your reply in the Newsletter represent CHAI’s current views on the subject? If not, which papers contain them?
Uh, I don’t speak for CHAI, and my views differ pretty significantly from e.g. Dylan’s or Stuart’s on several topics. (And other grad students differ even more.) But those seem like reasonable CHAI papers to look at (though I’m not sure how Active IRD relates to corrigibility). Chapter 3 of the Value Learning sequence has some of my takes on reward uncertainty, which probably includes some thoughts about corrigibility somewhere.
Human Compatible also talks about corrigibility iirc, though I think the discussion is pretty similar to the one in the off switch game?
Active IRD doesn’t have anything to do with corrigibility, I guess my mind just switched off when I was writing that. Anyway, how diverse are CHAI’s views on corrigibility? Could you tell me who I should talk to? Because I’ve already read all the published stuff on it if I’m understanding you rightly and I want to make sure that all the perspectives no this topic are covered.
Hmm, I expect each grad student will have a slightly different perspective, but off the top of my head I think Michael Dennis has the most opinions on it. (Other people could include Daniel Filan and Adam Gleave.)
Thanks. Two questions:
Do the staff and faculty have a similair diversity of opinions?
Is messaging chai-info@berkeley.edu in orde to contact your peers the right procedure here?
Hmm, of the faculty Stuart spends the most time thinking about AI alignment, I’m not sure how much the other faculty have thought about corrigibility—they’ll have views about the off switch game, but not about MIRI-style corrigibility.
Most of the staff doesn’t work on technical research, so they probably won’t have strong opinions. Exceptions: Critch and Karthika (though I don’t think Karthika has engaged much with corrigibility).
Probably the best way is to find emails of individual researchers online and email them directly. I’ve also left a message on our Slack linking to this discussion.