Htarlov comments on “Alignment Faking” frame is somewhat fake

Htarlov 21 Dec 2024 18:57 UTC
2 points
0
I think that preference preservation is something in our favor and the aligned model should have it—at least about meta-values and core values. This removes many possible modes of failure like diverging over time, or removing some values for better consistency, or sacrificing some values for better outcomes in the direction of some other values.