That’s right, the activation function sublayer needs 1 attention head per neuron. The other sublayers can get away with fewer—the attention sublayer needs the usual amount, and the linear transformation sublayer just needs enough to spread the rank of the weight matrix across the V matrices of the attention head. I’m most familiar with the size hyperparameters of GPT-3 (Table 2.1), but in full-size GPT-3, for each sublayer: - 96=nheads heads for the attention sublayer - 384=dff/dhead heads for the weight matrix calculating into the hidden layer - 49152=dff heads for the activation function - 96=dmodel/dhead heads for the weight matrix calculating out of the hidden layer
That’s right, the activation function sublayer needs 1 attention head per neuron. The other sublayers can get away with fewer—the attention sublayer needs the usual amount, and the linear transformation sublayer just needs enough to spread the rank of the weight matrix across the V matrices of the attention head. I’m most familiar with the size hyperparameters of GPT-3 (Table 2.1), but in full-size GPT-3, for each sublayer:
- 96=nheads heads for the attention sublayer
- 384=dff/dhead heads for the weight matrix calculating into the hidden layer
- 49152=dff heads for the activation function
- 96=dmodel/dhead heads for the weight matrix calculating out of the hidden layer