If you don’t know where you’re going, it’s not helpful enough not to go somewhere that’s definitely not where you want to end up; you have to differentiate paths towards the destination from all other paths, or you fail.
I’m not exactly sure what you meant here but I don’t think this claim is true in the case of RLHF because, in RLHF, labelers only need to choose which option is better or worse between two possibilities, and these choices are then used to train the reward model. A binary feedback style was chosen specifically because it’s usually too difficult for labelers to choose between multiple options.
A similar idea is comparison sorting where the algorithms only need the ability to compare two numbers at a time to sort a list of numbers.
I’m not exactly sure what you meant here but I don’t think this claim is true in the case of RLHF because, in RLHF, labelers only need to choose which option is better or worse between two possibilities, and these choices are then used to train the reward model. A binary feedback style was chosen specifically because it’s usually too difficult for labelers to choose between multiple options.
A similar idea is comparison sorting where the algorithms only need the ability to compare two numbers at a time to sort a list of numbers.