Maxwell Clarke comments on Categorizing failures as “outer” or “inner” misalignment is often confused

Maxwell Clarke 8 Jan 2023 1:02 UTC
LW: 1 AF: 1
0
AF
This is a good post, definitely shows that these concepts are confused. In a sense both examples are failures of both inner and outer alignment -
- Training the AI with reinforcement learning is a failure of outer alignment, because it does not provide enough information to fully specify the goal.
- The model develops within the possibilities allowed by the under-specified goal, and has behaviours misaligned with the goal we intended.
Also, the choice to train the AI on pull requests at all is in a sense an outer alignment failure.