David Scott Krueger (formerly: capybaralet) comments on Towards a mechanistic understanding of corrigibility

David Scott Krueger (formerly: capybaralet) 1 Mar 2020 20:36 UTC
LW: 1 AF: 1
AF
I’m not sure it’s the same thing as alignment… it seems there’s at least 3 concepts here, and Hjalmar is talking about the 2nd, which is importantly different from the 1st:
- “classic notion of alignment”: The AI has the correct goal (represented internally, e.g. as a reward function)
- “CIRL notion of alignment”: AI has a pointer to the correct goal (but the goal is represented externally, e.g. in a human partner’s mind)
- “corrigibility”: something else