TurnTrout comments on Inner and outer alignment decompose one hard problem into two extremely hard problems

TurnTrout 15 Dec 2022 22:43 UTC
LW: 2 AF: 2
0
AF
And an unintended optimum of the reward function rewards trajectories that are obviously bad
It seems not relevant if it’s an optimum or not. What’s relevant is the scalar reward values output on realized datapoints.
I emphasize this because “unintended optimum” phrasing seems to reliably trigger cached thoughts around “reward functions need to be robust graders.” (I also don’t like “optimum” of values, because I think that’s really not how values work in detail instead of in gloss, and “optimum” probably evokes similar thoughts around “values must be robust against adversaries.”)