I still don’t follow. Apparently, TL’s center_writing_weights is adapting the writing weights in a pre-LN-invariant fashion (and also in a way which doesn’t affect the softmax probabilities after unembed). This means the actual computations of the forward pass are left unaffected by this weight modification, up to precision limitations, right? So that means that our results in particular should not be affected by TL vs HF.
Oops, I was wrong in my initial hunch as I assumed centering writing did something extra. I’ve edited my top level comment, thanks for pointing out my oversight!
I still don’t follow. Apparently, TL’s
center_writing_weights
is adapting the writing weights in a pre-LN-invariant fashion (and also in a way which doesn’t affect the softmax probabilities after unembed). This means the actual computations of the forward pass are left unaffected by this weight modification, up to precision limitations, right? So that means that our results in particular should not be affected by TL vs HF.Oops, I was wrong in my initial hunch as I assumed centering writing did something extra. I’ve edited my top level comment, thanks for pointing out my oversight!