Thanks for the post! You mention that its unlikely PHF is as sample efficient as RLHF, do you have plans to explore that direction? Most attributes we’d like to condition on are not trivially inferred, so labels are scarce or expensive to acquire. I’m interested in how alignment scales with the amount of labeled data. Perhaps this work could synergize well with TracIn or Influence Functions to identify examples that help or hurt performance on a small test set.
In practice I think using a trained reward model (as in RLHF), not fixed labels, is the way forward. Then the cost of acquiring the reward model is the same as in RLHF, the difference is primarily that PHF typically needs much more calls to the reward model than RLHF.
Thanks for the post! You mention that its unlikely PHF is as sample efficient as RLHF, do you have plans to explore that direction? Most attributes we’d like to condition on are not trivially inferred, so labels are scarce or expensive to acquire. I’m interested in how alignment scales with the amount of labeled data. Perhaps this work could synergize well with TracIn or Influence Functions to identify examples that help or hurt performance on a small test set.
In practice I think using a trained reward model (as in RLHF), not fixed labels, is the way forward. Then the cost of acquiring the reward model is the same as in RLHF, the difference is primarily that PHF typically needs much more calls to the reward model than RLHF.