davidad comments on Trying to disambiguate different questions about whether RLHF is “good”

davidad 15 Dec 2022 9:08 UTC
LW: 2 AF: 1
0
AF
In RLHF there are at least three different (stochastic) reward functions:
1. the learned value network
2. the “human clicks 👍/👎” process, and
3. the “what if we asked a whole human research group and they had unlimited time and assistance to deliberate about this one answer” process.
I think the first two correspond to what that paper calls “proxy” and “gold” but I am instead concerned with the ways in which 2 is a proxy for 3.