The provided link assumes that any preference can be expressed as a utility function over world-states. If you don’t assume that (and you shouldn’t as human preferences can’t be expressed as such), you cannot maximize weighted average of potential utility functions. Some actions are preference-wise irreversible. Take for example virtue ethics: wiping out your memory doesn’t restore your status as a virtuous person even if the world doesn’t contain any information of your unvirtuous acts anymore, so you don’t plan to do that.
When I asked here earlier why the article “Problem of Fully Updated Deference” uses incorrect assumption, I’ve got the answer that it’s better to have some approximation than none as it allows to move forward in exploring the problem of alignment. But I see that it became an unconditional cornerstone and not a toy example of analysis.
The provided link assumes that any preference can be expressed as a utility function over world-states. If you don’t assume that (and you shouldn’t as human preferences can’t be expressed as such), you cannot maximize weighted average of potential utility functions. Some actions are preference-wise irreversible. Take for example virtue ethics: wiping out your memory doesn’t restore your status as a virtuous person even if the world doesn’t contain any information of your unvirtuous acts anymore, so you don’t plan to do that.
When I asked here earlier why the article “Problem of Fully Updated Deference” uses incorrect assumption, I’ve got the answer that it’s better to have some approximation than none as it allows to move forward in exploring the problem of alignment. But I see that it became an unconditional cornerstone and not a toy example of analysis.