Some hypotheses about such functions could be that the outliers perform some kind of large-scale bias or normalization role, that they are ‘empty’ dimensions where attention or MLPs can write various scratch or garbage values, or that they somehow play important roles in the computation of the network.
If the outliers were garbage values, wouldn’t that predict that zero ablation doesn’t increase loss much?
If the outliers were garbage values, wouldn’t that predict that zero ablation doesn’t increase loss much?