Thomas Kwa comments on AI Safety 101 : Reward Misspecification

Thomas Kwa 11 Nov 2023 10:00 UTC
2 points
0
I think a reader might benefit from a note that learning from feedback is generally more powerful than learning from imitation, in the sense that it preserves strictly more of the information in a reward function. Without this note, the reader might lack an understanding of why the field of alignment has moved from imitation to feedback. I certainly did before I knew this fact.
This was apparently folk knowledge for a while but was shown by Skalse et al in a paper at ICML 2023, who characterize the whole “reward learning lattice” of which reward learning data sources are more powerful than others.