Thomas Kwa comments on AI Safety 101 : Reward Misspecification

Thomas Kwa 3 Nov 2023 23:30 UTC
6 points
−2
Any treatment of Goodhart’s Law in the context of reward modeling should mention Scaling Laws for Reward Model Overoptimization (Gao et al., 2022), which actually measures how true reward increases and decreases as you optimize with respect to a proxy, and possibly also something like this recent paper, which uses early stopping to limit overoptimization and extends early stopping to the case of multiple reward models.