Transformers technically are an architecture which is completely orthogonal to training setup. However their main advantage in parallelization over the time dimension allows large training speedup and thus training over very large datasets. All of the largest datasets generally are not annotated and so permit only unsupervised training. So before transformers SL was the more dominant paradigm but foundation models are trained with UL on large internet size datasets.
Of course GPT models are pretrained with UL and then final training uses RLHF.
Transformers technically are an architecture which is completely orthogonal to training setup. However their main advantage in parallelization over the time dimension allows large training speedup and thus training over very large datasets. All of the largest datasets generally are not annotated and so permit only unsupervised training. So before transformers SL was the more dominant paradigm but foundation models are trained with UL on large internet size datasets.
Of course GPT models are pretrained with UL and then final training uses RLHF.
… and in between, instruction tuning uses SL. So they use all three paradigms.