3. Sparse attention patterns do not affect the number of parameters. This kind of sparsity is designed to make previous KV values easier to load during decoding, not to reduce the space the model takes up.
4. The only difference between encoder and decoder transformers is the attention mask. In an encoder, future tokens can attend to past tokens (acausal), while in a decoder, future tokens cannot attend to past tokens (causal attention). The term “decoder” is used because decoders can be used to generate text, while encoders cannot (since you can only run an encoder if you know the full input already).
5. Some people do increase the vocabulary size (e.g. AI21′s Jurrasic-1 model and BLOOM use a 256K vocabularies). However, this technique requires the model to read “more text” on average per run, and may hurt performance. It has fallen somewhat out of favor.
The only difference between encoder and decoder transformers is the attention mask. In an encoder, future tokens can attend to past tokens (acausal), while in a decoder, future tokens cannot attend to past tokens (causal attention). The term “decoder” is used because decoders can be used to generate text, while encoders cannot (since you can only run an encoder if you know the full input already).
A few answers to the open questions you gave:
3. Sparse attention patterns do not affect the number of parameters. This kind of sparsity is designed to make previous KV values easier to load during decoding, not to reduce the space the model takes up.
4. The only difference between encoder and decoder transformers is the attention mask. In an encoder, future tokens can attend to past tokens (acausal), while in a decoder, future tokens cannot attend to past tokens (causal attention). The term “decoder” is used because decoders can be used to generate text, while encoders cannot (since you can only run an encoder if you know the full input already).
5. Some people do increase the vocabulary size (e.g. AI21′s Jurrasic-1 model and BLOOM use a 256K vocabularies). However, this technique requires the model to read “more text” on average per run, and may hurt performance. It has fallen somewhat out of favor.
This was very helpful to me. Thank you.