Zach Furman comments on Against Almost Every Theory of Impact of Interpretability

Zach Furman 21 Aug 2023 17:35 UTC
1 point
0
if the distribution of intermediate neurons shifts so that Othello-board-state-detectors have a reasonably high probability of being instantiated
Yeah, this “if” was the part I was claiming permutation invariance causes problems for—that identically distributed neurons probably couldn’t express something as complicated as a board-state-detector. As soon as that’s true (plus assuming the board-state-detector is implemented linearly), agreed, you can recover it with a linear probe regardless of permutation-invariance.
This is a more reasonable objection(although actually, I’m not sure if independence does hold in the tensor programs framework—probably?)
I probably should’ve just gone with that one, since the independence barrier is the one I usually think about, and harder to get around (related to non-free-field theories, perturbation theory, etc).
My impression from reading through one of the tensor program papers a while back was that it still makes the IID assumption, but there could be some subtlety about that I missed.