I agree that there’s many reasons that directions do matter, but clearly distance would matter too in the softmax case!
Also, without layernorm, intermediate components of the network could “care more’ about the magnitude of the residual stream (whereas it only matters for the unembed here), while for networks w/ layernorm the intermediate components literally do not have access to magnitude data!
I agree that there’s many reasons that directions do matter, but clearly distance would matter too in the softmax case!
Also, without layernorm, intermediate components of the network could “care more’ about the magnitude of the residual stream (whereas it only matters for the unembed here), while for networks w/ layernorm the intermediate components literally do not have access to magnitude data!
fair. You convinced me that the effect is more determined by layer-norm than cross-entropy.