Misaligned goal --> I should get high reward --> Behavior aligned with reward function
The shortest description of this thought doesn’t include “I should get high reward” because that’s already implied by having a misaligned goal and planning with it.
In contrast, having only the goal “I should get high reward” may add description length like Johannes said. If so, the misaligned goal could well be equally simple or simpler than the high reward goal.
The shortest description of this thought doesn’t include “I should get high reward” because that’s already implied by having a misaligned goal and planning with it.
In contrast, having only the goal “I should get high reward” may add description length like Johannes said. If so, the misaligned goal could well be equally simple or simpler than the high reward goal.