The reformulation of preference to replace references to self with specific people it already references doesn’t change its meaning, so semantically such rewriting doesn’t affect alignment. It only affects copying, which doesn’t respect semantics of preferences. Other procedures that meddle with minds can disrupt semantics of preference in a way that can’t be worked around.
(All this only makes sense for toy agent models, that furthermore have a clear notion of references to self, not for literal humans. Humans don’t have preferences in this sense, human preference is a theoretical construct that needs something like CEV to access, the outcome of a properly set up long reflection.)
The reformulation of preference to replace references to self with specific people it already references doesn’t change its meaning, so semantically such rewriting doesn’t affect alignment. It only affects copying, which doesn’t respect semantics of preferences. Other procedures that meddle with minds can disrupt semantics of preference in a way that can’t be worked around.
(All this only makes sense for toy agent models, that furthermore have a clear notion of references to self, not for literal humans. Humans don’t have preferences in this sense, human preference is a theoretical construct that needs something like CEV to access, the outcome of a properly set up long reflection.)