While it may seem like a “linguistic quirk”, the term “pretraining” emerged to distinguish this initial phase of training the language model on a vast corpus of unlabeled text from the subsequent fine-tuning phase, where the pretrained model is adapted to a specific task using labeled data. This distinction became crucial as the pretraining step often required significant computational resources and time, while fine-tuning could be relatively more efficient and task specific.
One of the earliest mentions of this terminology can be found in the 2018 BERT paper:
“There are two existing strategies for applying pre-trained language representations to downstream tasks: feature-based and fine-tuning.” (Devlin et al., 2018)
The rise of large language models like GPT (Generative Pre-trained Transformer) from OpenAI and their impressive performance on various NLP tasks further solidified the importance of this pretraining paradigm. As these models grew larger and more complex, the pretraining phase became even more resource-intensive and critical to the overall performance of the models.
It’s worth noting that the term was not exclusively coined by large LLM companies, but rather emerged from the broader research community working on transfer learning and self-supervised pretraining techniques. However, the prominence of these companies and their large-scale language models likely contributed to the widespread adoption of the term “pretraining” in the ML and NLP communities.
Regarding the rationale behind using “pretraining” instead of “training,” it seems to stem from the distinction between the initial, resource-intensive phase of capturing general linguistic knowledge and the subsequent task-specific fine-tuning phase. The term “pretraining” emphasizes the preparatory nature of this initial phase, which is followed by fine-tuning or other task-specific training steps.
So yes, I believe, the emergence of the term “pretraining” can be attributed to the paradigm shift in NLP towards transfer learning and self-supervised pretraining techniques, which necessitated a clear distinction between the initial, resource-intensive phase of capturing general linguistic knowledge and the subsequent task-specific fine-tuning phase.
While it may seem like a “linguistic quirk”, the term “pretraining” emerged to distinguish this initial phase of training the language model on a vast corpus of unlabeled text from the subsequent fine-tuning phase, where the pretrained model is adapted to a specific task using labeled data. This distinction became crucial as the pretraining step often required significant computational resources and time, while fine-tuning could be relatively more efficient and task specific.
One of the earliest mentions of this terminology can be found in the 2018 BERT paper:
The rise of large language models like GPT (Generative Pre-trained Transformer) from OpenAI and their impressive performance on various NLP tasks further solidified the importance of this pretraining paradigm. As these models grew larger and more complex, the pretraining phase became even more resource-intensive and critical to the overall performance of the models.
It’s worth noting that the term was not exclusively coined by large LLM companies, but rather emerged from the broader research community working on transfer learning and self-supervised pretraining techniques. However, the prominence of these companies and their large-scale language models likely contributed to the widespread adoption of the term “pretraining” in the ML and NLP communities.
Regarding the rationale behind using “pretraining” instead of “training,” it seems to stem from the distinction between the initial, resource-intensive phase of capturing general linguistic knowledge and the subsequent task-specific fine-tuning phase. The term “pretraining” emphasizes the preparatory nature of this initial phase, which is followed by fine-tuning or other task-specific training steps.
So yes, I believe, the emergence of the term “pretraining” can be attributed to the paradigm shift in NLP towards transfer learning and self-supervised pretraining techniques, which necessitated a clear distinction between the initial, resource-intensive phase of capturing general linguistic knowledge and the subsequent task-specific fine-tuning phase.