I think self-critique runs into the issues I describe in the post, though without insider information I’m not certain. Naively it seems like existing distortions would become larger with self-critique, though.
For human rating/RL, it seems true that it’s possible to be sample efficient (with human brain behavior as an existence proof), but as far as I know we don’t actually know how to make it sample efficient in that way, and human feedback in the moment is even more finite than human text that’s just out there. So I still see that taking longer than, say, self play.
I agree that if outcome-based RL swamps initial training run datasets, then the “playing human roles” section is weaker, but is that the case now? My understanding (could easily be wrong) is that RLHF is a smaller postprocessing layer that only changes models moderately, and nowhere near the bulk of their training.
I think self-critique runs into the issues I describe in the post, though without insider information I’m not certain. Naively it seems like existing distortions would become larger with self-critique, though.
For human rating/RL, it seems true that it’s possible to be sample efficient (with human brain behavior as an existence proof), but as far as I know we don’t actually know how to make it sample efficient in that way, and human feedback in the moment is even more finite than human text that’s just out there. So I still see that taking longer than, say, self play.
I agree that if outcome-based RL swamps initial training run datasets, then the “playing human roles” section is weaker, but is that the case now? My understanding (could easily be wrong) is that RLHF is a smaller postprocessing layer that only changes models moderately, and nowhere near the bulk of their training.