It’d be more feasible if pure RLAIF for arbitrary constitutions becomes competitive with RLHF first, to make chatbots post-trained to be more human-like without bothering the labs to an unreasonable degree. Only this year’s frontier models started passing reading comprehension tests well, older or smaller models often make silly mistakes about subtler text fragments. From this I’d guess this year’s frontier models might be good enough for preference labeling as human substitutes, while earlier models aren’t. But RLHF with humans is still in use, so probably not. The next generationcurrentlyin training will be very robust at reading comprehension, more likely good enough at preference labeling. Another question is if this kind of effort can actually produce convincing human mimicry, even with human labelers.
It’d be more feasible if pure RLAIF for arbitrary constitutions becomes competitive with RLHF first, to make chatbots post-trained to be more human-like without bothering the labs to an unreasonable degree. Only this year’s frontier models started passing reading comprehension tests well, older or smaller models often make silly mistakes about subtler text fragments. From this I’d guess this year’s frontier models might be good enough for preference labeling as human substitutes, while earlier models aren’t. But RLHF with humans is still in use, so probably not. The next generation currently in training will be very robust at reading comprehension, more likely good enough at preference labeling. Another question is if this kind of effort can actually produce convincing human mimicry, even with human labelers.