We used TL to cache activations for all experiments, but are considering moving away to improve memory efficiency.
TL removes the mean from all additions to the residual stream which I would have guessed that this would solve the problem here.
Oh, somehow I’m not familiar with this. Is this center_unembed? Or are you talking about something else?
Do you have evidence for this?
Yes, but I think the evidence didn’t actually come from the “Love”—“Hate” prompt pair. Early in testing we found paired activation additions worked better. I don’t have a citeable experiment off-the-cuff, though.
This is turned on by default in TL, so okay I think that there must be something else weird about models rather than just a naive bias that causes you to need to do the difference thing
I still don’t follow. Apparently, TL’s center_writing_weights is adapting the writing weights in a pre-LN-invariant fashion (and also in a way which doesn’t affect the softmax probabilities after unembed). This means the actual computations of the forward pass are left unaffected by this weight modification, up to precision limitations, right? So that means that our results in particular should not be affected by TL vs HF.
Oops, I was wrong in my initial hunch as I assumed centering writing did something extra. I’ve edited my top level comment, thanks for pointing out my oversight!
We used TL to cache activations for all experiments, but are considering moving away to improve memory efficiency.
Oh, somehow I’m not familiar with this. Is this
center_unembed
? Or are you talking about something else?Yes, but I think the evidence didn’t actually come from the “Love”—“Hate” prompt pair. Early in testing we found paired activation additions worked better. I don’t have a citeable experiment off-the-cuff, though.
No this isn’t about center_unembed, it’s about center_writing_weights as explained here: https://github.com/neelnanda-io/TransformerLens/blob/main/further_comments.md#centering-writing-weights-center_writing_weight
This is turned on by default in TL, so okay I think that there must be something else weird about models rather than just a naive bias that causes you to need to do the difference thing
I still don’t follow. Apparently, TL’s
center_writing_weights
is adapting the writing weights in a pre-LN-invariant fashion (and also in a way which doesn’t affect the softmax probabilities after unembed). This means the actual computations of the forward pass are left unaffected by this weight modification, up to precision limitations, right? So that means that our results in particular should not be affected by TL vs HF.Oops, I was wrong in my initial hunch as I assumed centering writing did something extra. I’ve edited my top level comment, thanks for pointing out my oversight!