Neel Nanda comments on No Really, Attention is ALL You Need—Attention can do feedforward networks

Neel Nanda 4 Feb 2023 19:10 UTC
5 points
0
Cute construction! To check, am I correct that you’re adding an attention head per neuron? To me that makes this prohibitive enough to not actually be useful for real models—eg, in GPT-2 Small that’d take you from 12 heads per layer to about 3,000 per layer.
- Robert_AIZI 5 Feb 2023 12:29 UTC
  1 point
  0
  Parent
  That’s right, the activation function sublayer needs 1 attention head per neuron. The other sublayers can get away with fewer—the attention sublayer needs the usual amount, and the linear transformation sublayer just needs enough to spread the rank of the weight matrix across the V matrices of the attention head. I’m most familiar with the size hyperparameters of GPT-3 (Table 2.1), but in full-size GPT-3, for each sublayer:
  - $96 = n_{h e a d s}$ heads for the attention sublayer
  - $384 = d_{f f} / d_{h e a d}$ heads for the weight matrix calculating into the hidden layer
  - $49152 = d_{f f}$ heads for the activation function
  - $96 = d_{m o d e l} / d_{h e a d}$ heads for the weight matrix calculating out of the hidden layer