I sampled hundreds of short context snippets from openwebtext, and measured ablation effects averaged over those sampled forward-passes. Averaged over those hundreds of passes, I didn’t see any real signal in the logit effects, just a layer of noise due to the ablations.
More could definitely be done on this front. I just tried something relatively quickly that fit inside of GPU memory and wanted to report it here.
Epistemic status: Just a confusion I once had, and how I eventually resolved it to my satisfaction.
In ordinary differential equations, separability is a deductive rule stating that whenever you have a differential equation of the form
f(x)=g(y)dydx
you can then reason that
f(x)dx=g(y)dy
and then than
∫f(x)dx=∫g(y)dy
From the very first time I saw that, I was immediately off-put by that middle equation. What the hell does an expression like f(x)dx (by itself) even mean? Until I saw this, I had figured that, apart from their weird notation, differentiation and integration were just plain-old multivariate functions. I had made sense of their notation by just ignoring it, basically. And when I held that point of view, the above deduction is just nonsensical.
I also remember not getting good clarificatory answers about this at the time! I mostly recall being told to just ignore the middle equation and take the whole conditional on faith, as something that has been separately proven.
Eventually, I learned that there was this idea in math called differential forms which gave a precise-and-everywhere-valid interpretation to the stand-alone expression dx. But you don’t quite need that machinery to resolve the above thing that bothered me.
Did you know that “calculus,” is an abridgment of the original term “the infinitesimal calculus”? “The rules for soundly manipulating infinitesimal quantities,” basically. I did not know this when I first encountered this separability thing. There’s a whole saga, maybe even the main story in mathematics, about why that interpretation and corresponding terminology fell out of favor.
The basic infinitesimal calculus idea (which is only sometimes, not always, a valid interpretation of the symbols) is that
∫sum of their products over a rangefunction valuef(x)dxinfintesimal quantity
(I very vividly remember the moment when I discovered that the integral sign was just a stylized “S”, for “sum”!) Now you cannot everywhere use the above separability reasoning on the strength of the infinitesimal interpretation. Again, it’s not an everywhere-valid interpretation!
Once you’re using any everywhere-valid interpretation, using any way of giving dx and ∫ their own independent meanings as symbols, though, the separability deduction just falls out! If two things are equal, you can multiply both by any mathematical object and get a true equation. It doesn’t matter what kind of mathematical object dx is. If two things are equal, you can apply the same operation to both and get a true equation. It doesn’t matter what the integration summation operation amounts to, precisely.