I’m way more used to thinking about weird maths or distributed algorithms or abstract philosophical problems than about concrete machine learning architectures. But based on everything I see about GPT-3, it seems a nice idea to learn more about it, even if only for participating in the discussion without spouting non-sense.
So I’m asking for what you think are the must-reads on GPT-3 specifically, and maybe any requirement to understand them.
nostalgebraist’s blog is a must-read regarding GPT-x, including GPT-3. Perhaps, start here (“the transformer… ‘explained’?”), which helps to contextualize GPT-x within the history of machine learning.
(Though, I should note that nostalgebraist holds a contrarian “bearish” position on GPT-3 in particular; for the “bullish” case instead, read Gwern.)
Thanks for the answer! I knew about the “transformer explained” post, but I was not aware of its author’s position on GPT-3.
Here’s a list of resources that may be of use to you. The GPT-3 paper isn’t too specific on implementation details because the changes that led to it were rather incremental (especially from GPT-2, and more so the farther back we look at the Transformer lineage). So the scope to understand GPT-3 is broader than one might expect.
https://github.com/jalammar/jalammar.github.io/blob/master/notebooks/nlp/01_Exploring_Word_Embeddings.ipynb
http://www.peterbloem.nl/blog/transformers
http://jalammar.github.io/illustrated-transformer/
https://amaarora.github.io/2020/02/18/annotatedGPT2.html
http://jalammar.github.io/illustrated-gpt2/
http://jalammar.github.io/how-gpt3-works-visualizations-animations/
https://arxiv.org/pdf/1409.0473.pdf Attention (initial)
https://arxiv.org/pdf/1706.03762.pdf Attention Is All You Need
http://nlp.seas.harvard.edu/2018/04/03/attention.html (annotated)
https://www.arxiv-vanity.com/papers/1904.02679/ Visualizing Attention
https://stats.stackexchange.com/questions/421935/what-exactly-are-keys-queries-and-values-in-attention-mechanisms
https://arxiv.org/pdf/1807.03819.pdf Universal Transformers
https://arxiv.org/pdf/2007.14062.pdf Big Bird (see appendices)
https://www.reddit.com/r/MachineLearning/comments/hxvts0/d_breaking_the_quadratic_attention_bottleneck_in/
https://www.tensorflow.org/tutorials/text/transformer
https://www.tensorflow.org/tutorials/text/nmt_with_attention
https://cdn.openai.com/blocksparse/blocksparsepaper.pdf
https://openai.com/blog/block-sparse-gpu-kernels/
https://github.com/pbloem/former/blob/master/former/transformers.py
https://github.com/openai/blocksparse/blob/master/examples/transformer/enwik8.py
https://github.com/google/trax/blob/master/trax/models/transformer.py
https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_gpt2.py
Thanks! I’ll try to read that.