But through gradient descent, shards act upon the neural networks by leaving imprints of themselves, and these imprints have no reason to be concentrated in any one spot of the network (whether activation-space or weight-space).
What does ‘one spot’ mean here?
If you just mean ‘a particular entry or set of entries of the weight vector in the standard basis the network is initalised in’, then sure, I agree.
But that just means you have to figure out a different representation of the weights, one that carves the logic flow of the algorithm the network learned at its joints. Such a representation may not have much reason to line up well with any particular neurons, layers, attention heads or any other elements we use to talk about the architecture of the network. That doesn’t mean it doesn’t exist.
Nontrivial algorithms of LLMs require scaffolding and so aren’t really concentrated within the network’s internal computation flow. Even something as simple as generating text requires repeatedly feeding sampled tokens back to the network, which means that the network has an extra connection from outputs to input that is rarely modelled by mechanistic interpretability.
What does ‘one spot’ mean here?
If you just mean ‘a particular entry or set of entries of the weight vector in the standard basis the network is initalised in’, then sure, I agree.
But that just means you have to figure out a different representation of the weights, one that carves the logic flow of the algorithm the network learned at its joints. Such a representation may not have much reason to line up well with any particular neurons, layers, attention heads or any other elements we use to talk about the architecture of the network. That doesn’t mean it doesn’t exist.
Nontrivial algorithms of LLMs require scaffolding and so aren’t really concentrated within the network’s internal computation flow. Even something as simple as generating text requires repeatedly feeding sampled tokens back to the network, which means that the network has an extra connection from outputs to input that is rarely modelled by mechanistic interpretability.