jacob_cannell comments on Open & Welcome Thread – April 2023

jacob_cannell 16 Apr 2023 1:22 UTC
2 points
0
Transformers technically are an architecture which is completely orthogonal to training setup. However their main advantage in parallelization over the time dimension allows large training speedup and thus training over very large datasets. All of the largest datasets generally are not annotated and so permit only unsupervised training. So before transformers SL was the more dominant paradigm but foundation models are trained with UL on large internet size datasets.

Of course GPT models are pretrained with UL and then final training uses RLHF.
- cubefox 16 Apr 2023 18:18 UTC
  3 points
  0
  Parent
  … and in between, instruction tuning uses SL. So they use all three paradigms.