GPT is decoder only. The part labeled as “Not in GPT” is decoder part.
I think both of these statements are true. Despite this, I think the architecture shown in “Not in GPT” is correct, because (as I understand it) “encoder” and “decoder” are interchangeable unless both are present. That’s what I was trying to get at here:
4. GPT is called a “decoder only” architecture. Would “encoder only” be equally correct? From my reading of the original transformer paper, encoder and decoder blocks are the same except that decoder blocks attend to the final encoder block. Since GPT never attends to any previous block, if anything I feel like the correct term is “encoder only”.
See this comment for more discussion of the terminology.
I think both of these statements are true. Despite this, I think the architecture shown in “Not in GPT” is correct, because (as I understand it) “encoder” and “decoder” are interchangeable unless both are present. That’s what I was trying to get at here:
See this comment for more discussion of the terminology.