Robert_AIZI comments on No Really, Attention is ALL You Need—Attention can do feedforward networks

Robert_AIZI 5 Feb 2023 12:29 UTC
1 point
0
That’s right, the activation function sublayer needs 1 attention head per neuron. The other sublayers can get away with fewer—the attention sublayer needs the usual amount, and the linear transformation sublayer just needs enough to spread the rank of the weight matrix across the V matrices of the attention head. I’m most familiar with the size hyperparameters of GPT-3 (Table 2.1), but in full-size GPT-3, for each sublayer:
- $96 = n_{h e a d s}$ heads for the attention sublayer
- $384 = d_{f f} / d_{h e a d}$ heads for the weight matrix calculating into the hidden layer
- $49152 = d_{f f}$ heads for the activation function
- $96 = d_{m o d e l} / d_{h e a d}$ heads for the weight matrix calculating out of the hidden layer