In prior work I’ve done, I’ve found that activations have tails between e−x2 and e−x (typically closer to e−x). As such, they’re probably better modeled as logistic distributions.
That said, different directions in the residual stream have quite different distributions. This depends considerably on how you select directions—I imagine random directions are more gaussian due to CLT. (Note that averaging together heavier tailed distributions takes a very long time to be become gaussian.)
But, if you look at (e.g.) the directions selected by neurons optimized for sparsity, I’ve commonly observed bimodal distributions, heavy skew, etc. My low confidence guess is that this is primarily because various facts about language have these properties and exhibiting this structure in the model is an efficient way to capture this.
In prior work I’ve done, I’ve found that activations have tails between e−x2 and e−x (typically closer to e−x). As such, they’re probably better modeled as logistic distributions.
That said, different directions in the residual stream have quite different distributions. This depends considerably on how you select directions—I imagine random directions are more gaussian due to CLT. (Note that averaging together heavier tailed distributions takes a very long time to be become gaussian.) But, if you look at (e.g.) the directions selected by neurons optimized for sparsity, I’ve commonly observed bimodal distributions, heavy skew, etc. My low confidence guess is that this is primarily because various facts about language have these properties and exhibiting this structure in the model is an efficient way to capture this.
This is a broadly similar point to @Fabien Roger.
See also the Curve Detectors paper for a very narrow example of this (https://distill.pub/2020/circuits/curve-detectors/#dataset-analysis—a straight line on a log prob plot indicates exponential tails).
I believe the phenomena of neurons often having activation distributions with exponential tails was first informally observed by Brice Menard.
Do you have a reference to the work you’re talking about? I’m doing some stuff involving fitting curves to activation tails currently.
Unpublished and not written up. Sorry.