Nora Belrose comments on Pre-Training + Fine-Tuning Favors Deception

Nora Belrose 3 Apr 2023 20:02 UTC
1 point
Assume that pre-training has produced a model that optimizes for the pre-training loss and is one of the above types.
As you note, this is an important assumption for the argument, and I think it’s likely false, at least for self-supervised pre-training tasks. I don’t think LLMs for example are well-described as “optimizing for” low perplexity at inference time. It’s not even clear to me what that would mean since there is no ground truth next token during autoregressive generation, so “low perplexity” is not defined. Rather, SGD simply produces a bundle of heuristics defining a probability distribution that matches the empirical distribution of human text quite well.
I do think your argument may apply to cases where you pre-train on an RL task and fine tune on another one, although even there it’s unclear.