Richard_Ngo comments on Thoughts on gradient hacking

Richard_Ngo 6 Sep 2021 14:41 UTC
LW: 4 AF: 3
AF
I discuss the possibility of it going in some other direction when I say “The two most salient options to me”. But the bit of Evan’s post that this contradicts is:
Now, if the model gets to the point where it’s actually just failing because of this, then gradient descent will probably just remove that check—but the trick is never to actually get there.