If the human doesn’t know what they would want, it doesn’t seem fair to blame the problem on alignment failure. In such a case, the problem would be a person’s lack of clarity.
Hmm, I see what you mean. However, that person’s lack of clarity would in fact be also called “bad prediction”, which is something I’m trying to point out at the post! These bad predictions can happen due to a different number of factors (missing relevant variables, misspecified initial conditions...). I believe the only reason we don’t call it “misaligned behaviour” is because we’re assuming that people do not (usually) act according to a explicitly stated reward function!
What do you think?
Humans are notoriously good rationalizers and may downplay their own bad decisions. Making a fair comparison between “what the human would have done” versus “what the AI agent would have done” may be quite tricky. (See the Fundamental Attribution Error a.k.a. correspondence bias.
Hmm, I see what you mean. However, that person’s lack of clarity would in fact be also called “bad prediction”, which is something I’m trying to point out at the post! These bad predictions can happen due to a different number of factors (missing relevant variables, misspecified initial conditions...). I believe the only reason we don’t call it “misaligned behaviour” is because we’re assuming that people do not (usually) act according to a explicitly stated reward function!
What do you think?
Thanks for this pointer!