Human ability to model other human preferences may be an evidence that alignment is possible: we evolved to present and predict each other (and our own) goals. So our goals are expressed in the ways which could be reconstructed by other agent.
However, “X is not about X” could be true here. What humans think to be their “goals” or “rationality”, could be not it, but just some signals. For example, being angry on someone and being the one on whom someone is angry is very clear situation for both humans, but what it actually mean for outside non-human observer? Is it a temporary tantrum of a friend, or a precommitment to kill? Is it a joke, a theatre, an expression of love or an act of war?
Human ability to model other human preferences may be an evidence that alignment is possible: we evolved to present and predict each other (and our own) goals. So our goals are expressed in the ways which could be reconstructed by other agent.
However, “X is not about X” could be true here. What humans think to be their “goals” or “rationality”, could be not it, but just some signals. For example, being angry on someone and being the one on whom someone is angry is very clear situation for both humans, but what it actually mean for outside non-human observer? Is it a temporary tantrum of a friend, or a precommitment to kill? Is it a joke, a theatre, an expression of love or an act of war?