I’m not sure it’s the same thing as alignment… it seems there’s at least 3 concepts here, and Hjalmar is talking about the 2nd, which is importantly different from the 1st:
“classic notion of alignment”: The AI has the correct goal (represented internally, e.g. as a reward function)
“CIRL notion of alignment”: AI has a pointer to the correct goal (but the goal is represented externally, e.g. in a human partner’s mind)
I’m not sure it’s the same thing as alignment… it seems there’s at least 3 concepts here, and Hjalmar is talking about the 2nd, which is importantly different from the 1st:
“classic notion of alignment”: The AI has the correct goal (represented internally, e.g. as a reward function)
“CIRL notion of alignment”: AI has a pointer to the correct goal (but the goal is represented externally, e.g. in a human partner’s mind)
“corrigibility”: something else