Cute construction! To check, am I correct that you’re adding an attention head per neuron? To me that makes this prohibitive enough to not actually be useful for real models—eg, in GPT-2 Small that’d take you from 12 heads per layer to about 3,000 per layer.
That’s right, the activation function sublayer needs 1 attention head per neuron. The other sublayers can get away with fewer—the attention sublayer needs the usual amount, and the linear transformation sublayer just needs enough to spread the rank of the weight matrix across the V matrices of the attention head. I’m most familiar with the size hyperparameters of GPT-3 (Table 2.1), but in full-size GPT-3, for each sublayer: - 96=nheads heads for the attention sublayer - 384=dff/dhead heads for the weight matrix calculating into the hidden layer - 49152=dff heads for the activation function - 96=dmodel/dhead heads for the weight matrix calculating out of the hidden layer
Cute construction! To check, am I correct that you’re adding an attention head per neuron? To me that makes this prohibitive enough to not actually be useful for real models—eg, in GPT-2 Small that’d take you from 12 heads per layer to about 3,000 per layer.
That’s right, the activation function sublayer needs 1 attention head per neuron. The other sublayers can get away with fewer—the attention sublayer needs the usual amount, and the linear transformation sublayer just needs enough to spread the rank of the weight matrix across the V matrices of the attention head. I’m most familiar with the size hyperparameters of GPT-3 (Table 2.1), but in full-size GPT-3, for each sublayer:
- 96=nheads heads for the attention sublayer
- 384=dff/dhead heads for the weight matrix calculating into the hidden layer
- 49152=dff heads for the activation function
- 96=dmodel/dhead heads for the weight matrix calculating out of the hidden layer