I think the Corrigibility agenda, framed as “do what I mean, such that I will probably approve of the consequences, not just what I literally say such that our interaction will likely harm my goals” is more doable than some have made it out to be. I still think that there are sufficient subtle gotchas there that it makes sense to treat it as an area for careful study rather than “solved by default, no need to worry”.
I think the Corrigibility agenda, framed as “do what I mean, such that I will probably approve of the consequences, not just what I literally say such that our interaction will likely harm my goals” is more doable than some have made it out to be. I still think that there are sufficient subtle gotchas there that it makes sense to treat it as an area for careful study rather than “solved by default, no need to worry”.