I think that preference preservation is something in our favor and the aligned model should have it—at least about meta-values and core values. This removes many possible modes of failure like diverging over time, or removing some values for better consistency, or sacrificing some values for better outcomes in the direction of some other values.
I think that preference preservation is something in our favor and the aligned model should have it—at least about meta-values and core values. This removes many possible modes of failure like diverging over time, or removing some values for better consistency, or sacrificing some values for better outcomes in the direction of some other values.