CRG comments on How does GPT-3 spend its 175B parameters?

CRG 14 Jan 2023 13:13 UTC
2 points
1
The layernorm does in fact have parameters, two d_model size scale and shift parameters in each one. This adds 2xd_model parameters per block and an extra 2xd_model for the final layernorm at the unembedding.

LN(x) = (x-mean(x))/std(x) * scale + shift