GPT is called a “decoder only” architecture. Would “encoder only” be equally correct? From my reading of the original transformer paper, encoder and decoder blocks are the same except that decoder blocks attend to the final encoder block. Since GPT never attends to any previous block, if anything I feel like the correct term is “encoder only”.
I believe “encoder” refers exclusively to the part of the model that reads in text to generate an internal representation, while “decoder” refers exclusively to the part that takes the representation created by the encoder as input and uses it to predict an output token sequence. Encoder takes in a sequence of raw tokens and transforms them into a sequence of encoded tokens. Decoder takes in a sequence of encoded tokens and transforms them into a new sequence of raw tokens.
It was originally assumed that doing the encoder-decoder conversion could be really important for tasks like translation, but it turns out that just feeding a decoder raw tokens as input and training it on next-token prediction on a large enough corpus gets you a model that can do that anyway.
I believe “encoder” refers exclusively to the part of the model that reads in text to generate an internal representation
Architecturally, I think the big difference is bi-directional (BERT can use future tokens to influence latent features of current tokens) vs. uni-directional (GPT only flows information from past to future). You could totally use the “encoder” to generate text, or the “decoder” to generate latent representations used for another task, though perhaps they’re more suited for their typical roles.
EDIT: Whoops, was wrong in initial version of comment.
I believe “encoder” refers exclusively to the part of the model that reads in text to generate an internal representation, while “decoder” refers exclusively to the part that takes the representation created by the encoder as input and uses it to predict an output token sequence. Encoder takes in a sequence of raw tokens and transforms them into a sequence of encoded tokens. Decoder takes in a sequence of encoded tokens and transforms them into a new sequence of raw tokens.
It was originally assumed that doing the encoder-decoder conversion could be really important for tasks like translation, but it turns out that just feeding a decoder raw tokens as input and training it on next-token prediction on a large enough corpus gets you a model that can do that anyway.
Architecturally, I think the big difference is bi-directional (BERT can use future tokens to influence latent features of current tokens) vs. uni-directional (GPT only flows information from past to future). You could totally use the “encoder” to generate text, or the “decoder” to generate latent representations used for another task, though perhaps they’re more suited for their typical roles.
EDIT: Whoops, was wrong in initial version of comment.