tailcalled comments on Renormalization Redux: QFT Techniques for AI Interpretability

tailcalled 18 Jan 2025 19:46 UTC
6 points
0
The way I (computer scientist who dabbles in physics, so YMMV I might be wrong) understand the physics here:
- Feynmann diagrams are basically a Taylor expansion of a physical system in terms of the strength of some interaction,
- To avoid using these Taylor expansions for everything, one tries to modify the parameters of the model to take a summary of the effects into account; for instance one distinguishes between the “bare mass”, which doesn’t take various interactions into account, versus the “effective mass”, which does,
- Sometimes e.g. the Taylor series don’t converge (or some integrals people derived from the Taylor expansions don’t converge), but you know what the summary parameters turn out to be in the real world, and so you can just pretend the calculations do converge into whatever gives the right summary parameters (which makes sense if we understand the model is just an approximation given what’s known and at some point the model breaks down).
Meanwhile, for ML:
- Causal scrubbing is pretty related to Taylor expansions, which makes it pretty related to Feynmann diagrams,
- However, it lacks any model for the non-interaction/non-Taylor-expanded effects, and so there’s no parameters that these Taylor expansions can be “absorbed into”,
- While Taylor expansions can obviously provide infinite detail, nobody has yet produced any calculations for causal scrubbing that fail to converge rather than simply being unreasonably complicated. This is partly because without the model above, there’s not many calculations that are worth running.
I’ve been thinking about various ideas for Taylor expansions and approximations for neural networks, but I kept running in circles, and the main issue I’ve ended up with is this:
In order to eliminate noise, we need to decide what really matters and what doesn’t really matter. However, purely from within the network, we have no principled way of doing so. The closest we get is what affects the token predictions for the network, but even that contains too may unimportant parameters, because if e.g. the network goes off on a tangent but then returns to the main topic, maybe that tangent didn’t matter and we’re fine with the approximation discarding it.
As a simplified version of this objection, consider that the token probabilities are not the final output of the network, but instead the tokens are sampled and fed back into the network, which means that really the final layer of the network is connected back to the first layer through a non-differentiable function. (The non-differentiability interferes with any interpretability method based on derivatives....)
What we really want to know is the impacts of the network in real-world scenarios, but it’s hard to notice main consequences of the network, and even if we could, it’s hard to set up measurable toy models of them. Once we had such toy models, it’s unclear whether we’d even need elaborate techniques for interpreting them. If for instance Claude is breaking a generation of young nerds by praising any nonsensical thing they say by responding “Very insightful!”, that doesn’t really need any advanced interpretability techniques to be understood.