In RLHF there are at least three different (stochastic) reward functions:
the learned value network
the “human clicks 👍/👎” process, and
the “what if we asked a whole human research group and they had unlimited time and assistance to deliberate about this one answer” process.
I think the first two correspond to what that paper calls “proxy” and “gold” but I am instead concerned with the ways in which 2 is a proxy for 3.
In RLHF there are at least three different (stochastic) reward functions:
the learned value network
the “human clicks 👍/👎” process, and
the “what if we asked a whole human research group and they had unlimited time and assistance to deliberate about this one answer” process.
I think the first two correspond to what that paper calls “proxy” and “gold” but I am instead concerned with the ways in which 2 is a proxy for 3.