TurnTrout comments on Models Don’t “Get Reward”

TurnTrout 30 Dec 2022 19:34 UTC
34 points
20
More generally, this is suggestive of the idea: to the extent possible, train values before training competence. This in turn implies that it’s a mistake to fine-tune already pre-trained language models on human feedback, because by then they already have concepts like “obvious lie” vs. “nonobvious lie”, and fine-tuning may just push them from preferring the first to the second. Instead, some fine-tuning should happen as early as possible.
(I think this a really intriguing hypothesis; strong-upvote)