5. Why not increase the context length and vocabulary size? The parameter costs of a larger context window and vocabulary size would be trivial. I could imagine that the vocabulary is already “good enough” and doesn’t need improving, but why not increase context length? Could this be the related to the “compute costs” I’ve heard rumors of?
Yes, compute should scale quadratically with the context window. The number of parameters does not scale at all with the context window (except for the positional embeddings I guess, but in other transformers, those aren’t trained and use no parameters at all).
Aside which the original author may be interested in—there has been some work done to reduce the scaling of the context window below O(n^2) -- e.g. https://arxiv.org/pdf/1904.10509v1.pdf. I also think of OpenAI’s jukebox which uses a hierarchical strategy in addition to factorized self-attention for generating tokens to effectively increase the context window (https://openai.com/blog/jukebox/)
Yes, compute should scale quadratically with the context window. The number of parameters does not scale at all with the context window (except for the positional embeddings I guess, but in other transformers, those aren’t trained and use no parameters at all).
Aside which the original author may be interested in—there has been some work done to reduce the scaling of the context window below O(n^2) -- e.g. https://arxiv.org/pdf/1904.10509v1.pdf. I also think of OpenAI’s jukebox which uses a hierarchical strategy in addition to factorized self-attention for generating tokens to effectively increase the context window (https://openai.com/blog/jukebox/)