A thing that stands out more starkly to me than the number of parameters is that GPT-3 is able to produce human-like text by passing a merely 12K-dimensional vector along its residual stream, through merely 96 successive steps of elementary transformations (even if each transformation is coded by lots of parameters). No features that require more contemplation than that can be computed during inference, deliberative thought that’s not written out in tokens needs to fit in there. The whole context needs to be comprehended within that limit, from the first word to the last one.
Perhaps GPT-4′s advantage is in having enough layers to also think a bit more about what’s going on and carry out more complicated plans without thinking out loud in tokens, capturing in the middle layers the features that human brain would need multiple recurrent passes to formulate. But the low dimension is still shocking, it wasn’t obvious with older NLP work that this order of amount of data is enough to capture human-level cognition, since 1K-dimensional embeddings are also good for indexing/search and it wasn’t clear how far up the nuance would need to go.
A thing that stands out more starkly to me than the number of parameters is that GPT-3 is able to produce human-like text by passing a merely 12K-dimensional vector along its residual stream, through merely 96 successive steps of elementary transformations (even if each transformation is coded by lots of parameters). No features that require more contemplation than that can be computed during inference, deliberative thought that’s not written out in tokens needs to fit in there. The whole context needs to be comprehended within that limit, from the first word to the last one.
Perhaps GPT-4′s advantage is in having enough layers to also think a bit more about what’s going on and carry out more complicated plans without thinking out loud in tokens, capturing in the middle layers the features that human brain would need multiple recurrent passes to formulate. But the low dimension is still shocking, it wasn’t obvious with older NLP work that this order of amount of data is enough to capture human-level cognition, since 1K-dimensional embeddings are also good for indexing/search and it wasn’t clear how far up the nuance would need to go.