Other people know more than me, but my impression was that the heritage of LLMs was things like ULMFiT (2018), where the goal was not to generate text but rather to do non-generative NLP tasks like sentiment-classification, spam-detection, and so on. Then you (1) do self-supervised “pretraining”, (2) edit/replace the output layer(s) to convert it from “a model that can output token predictions” to “a model that can output text-classifier labels / scores”, (3) fine-tune this new model (especially the newly-added parts) on human-supervised (text, label) pairs. Or something like that.
The word “pretraining” makes more sense than “training” in that context because “training” would incorrectly imply “training the model to do text classification”, i.e. the eventual goal. …And then I guess the term “pretraining” stuck around after it stopped making so much sense.
Thanks for these points! I think I understand the history of what has happened here better now—and the reasons for my misapprehension. Essentially, what I think happened is
a.) LLM/NLP research always (?) used ‘pretraining’ for a long time back at least to 2017 era for a general training of a model not specialised for a certain NLP task (such as NER, syntax parsing, etc)
b.) rest of ML mostly used ‘training’ because they by and by large didn’t do massive unsupervised training on unrelated tasks—i.e. CV just had imagenet or whatever
c.) In 2020-2022 period NLP with transformers went from fairly niche subfield of ML to memetically dominant due to massive success of transformer GPT models
d.) This meant both that their linguistic descriptions of ‘pretraining’ spread much more widely due to uptake of similar methods in other subfields and that I got much more involved in looking at NLP / LLM research than I had in the past where I personally had focused more on CV and RL leading to its sudden appearance in my personal experience (which turned out to be wrong).
Other people know more than me, but my impression was that the heritage of LLMs was things like ULMFiT (2018), where the goal was not to generate text but rather to do non-generative NLP tasks like sentiment-classification, spam-detection, and so on. Then you (1) do self-supervised “pretraining”, (2) edit/replace the output layer(s) to convert it from “a model that can output token predictions” to “a model that can output text-classifier labels / scores”, (3) fine-tune this new model (especially the newly-added parts) on human-supervised (text, label) pairs. Or something like that.
The word “pretraining” makes more sense than “training” in that context because “training” would incorrectly imply “training the model to do text classification”, i.e. the eventual goal. …And then I guess the term “pretraining” stuck around after it stopped making so much sense.
Yes, the ULMFiT paper is one of the first papers using the notion of “pretraining” (it might be the one which actually introduces this terminology).
Then it appears in other famous 2018 papers:
Improving Language Understanding by Generative Pre-Training (Radford et al., June 2018)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Thanks for these points! I think I understand the history of what has happened here better now—and the reasons for my misapprehension. Essentially, what I think happened is
a.) LLM/NLP research always (?) used ‘pretraining’ for a long time back at least to 2017 era for a general training of a model not specialised for a certain NLP task (such as NER, syntax parsing, etc)
b.) rest of ML mostly used ‘training’ because they by and by large didn’t do massive unsupervised training on unrelated tasks—i.e. CV just had imagenet or whatever
c.) In 2020-2022 period NLP with transformers went from fairly niche subfield of ML to memetically dominant due to massive success of transformer GPT models
d.) This meant both that their linguistic descriptions of ‘pretraining’ spread much more widely due to uptake of similar methods in other subfields and that I got much more involved in looking at NLP / LLM research than I had in the past where I personally had focused more on CV and RL leading to its sudden appearance in my personal experience (which turned out to be wrong).