We performed a blind pairwise comparison between text-davinci-003 and Alpaca 7B, and we found that these two models have very similar performance: Alpaca wins 90 versus 89 comparisons against text-davinci-003.
Interestingly, Alpaca is trained using supervised finetuning, not RLHF. (text-davinci-003 is trained using RLHF.) This seems to confirm my suspicion that while RLHF improves performance it is not essential.
And so it begins, LLM-generated datasets useful for training LLMs, that wouldn’t be found in the wild and would’ve been (prohibitively) expensive to purposefully generate with human labor. Hopefully the currently human-generated datasets used in SSL pre-training, the backbone of samulacrum alignment, won’t be mostly replaced by synthetic datasets that drift away from humanity.
Damn, that’s something I had been worrying about recently.
Eliezer said:
https://twitter.com/ESYudkowsky/status/1635577836525469697
Interesting video on the topic: The Model That Changes Everything: Alpaca Breakthrough (ft. Apple’s LLM, BritGPT, Ernie and AlexaTM)