This is a good post, definitely shows that these concepts are confused. In a sense both examples are failures of both inner and outer alignment -
Training the AI with reinforcement learning is a failure of outer alignment, because it does not provide enough information to fully specify the goal.
The model develops within the possibilities allowed by the under-specified goal, and has behaviours misaligned with the goal we intended.
Also, the choice to train the AI on pull requests at all is in a sense an outer alignment failure.
This is a good post, definitely shows that these concepts are confused. In a sense both examples are failures of both inner and outer alignment -
Training the AI with reinforcement learning is a failure of outer alignment, because it does not provide enough information to fully specify the goal.
The model develops within the possibilities allowed by the under-specified goal, and has behaviours misaligned with the goal we intended.
Also, the choice to train the AI on pull requests at all is in a sense an outer alignment failure.