Further evidence for what you show in this post, is the plot of the kurtosis of the activation vectors with varying dropout. Using kurtosis to measure the presence of a privileged basis is introduced in Privileged Bases in the Transformer Residual Stream. Activations in a non-privileged basis should be from a distribution with kurtosis ~3 (isotropic gaussian).
With a higher dropout, you also get a higher kurtosis, which implies a heavier tailed distribution of activations.
Further evidence for what you show in this post, is the plot of the kurtosis of the activation vectors with varying dropout. Using kurtosis to measure the presence of a privileged basis is introduced in Privileged Bases in the Transformer Residual Stream. Activations in a non-privileged basis should be from a distribution with kurtosis ~3 (isotropic gaussian).
With a higher dropout, you also get a higher kurtosis, which implies a heavier tailed distribution of activations.