So more concretely, this is work towards some sort of RLHF training regime that “provably” avoids Goodharting. The main issue is that a lot of the numbers we’re using are quite hard to approximate.
So more concretely, this is work towards some sort of RLHF training regime that “provably” avoids Goodharting. The main issue is that a lot of the numbers we’re using are quite hard to approximate.