Neither have I. I vaguely recall a call for volunteers on the Slack very early on for crowdsourcing tasks/instruction-following prompts and completions, and I speculate this might be the origin: the instruction series may be simply a model finetuned on a small corpus of handcorrected or handwritten demonstrations of ‘following instructions’. If there’s any use of the fancy RL or preference learning work, they haven’t mentioned it that I’ve seen. (In the most recent finetuning paper, none of the examples look like generic ‘instructions’.)
Is there an explanation how it works somewhere?
I haven’t seen a writeup anywhere of how it was trained.
Neither have I. I vaguely recall a call for volunteers on the Slack very early on for crowdsourcing tasks/instruction-following prompts and completions, and I speculate this might be the origin: the instruction series may be simply a model finetuned on a small corpus of handcorrected or handwritten demonstrations of ‘following instructions’. If there’s any use of the fancy RL or preference learning work, they haven’t mentioned it that I’ve seen. (In the most recent finetuning paper, none of the examples look like generic ‘instructions’.)