More generally, this is suggestive of the idea: to the extent possible, train values before training competence. This in turn implies that it’s a mistake to fine-tune already pre-trained language models on human feedback, because by then they already have concepts like “obvious lie” vs. “nonobvious lie”, and fine-tuning may just push them from preferring the first to the second. Instead, some fine-tuning should happen as early as possible.
(I think this a really intriguing hypothesis; strong-upvote)
(I think this a really intriguing hypothesis; strong-upvote)