ada comments on How does GPT-3 spend its 175B parameters?

ada 16 Jan 2023 21:55 UTC
1 point
0
Aside which the original author may be interested in—there has been some work done to reduce the scaling of the context window below O(n^2) -- e.g. https://arxiv.org/pdf/1904.10509v1.pdf. I also think of OpenAI’s jukebox which uses a hierarchical strategy in addition to factorized self-attention for generating tokens to effectively increase the context window (https://openai.com/blog/jukebox/)