tailcalled comments on Why I’m bearish on mechanistic interpretability: the shards are not in the network

tailcalled 13 Sep 2024 21:57 UTC
3 points
1
It’s clearer to me that the structure of the world is centered on emanations, erosions, bifurcations and accumulations branching out from the sun than that these phenomena can be can be modelled purely as resource-flows. Really, even “from the sun” is somewhat secondary; I originally came to this line of thought while statistically modelling software performance problems, leading to a model I call “linear diffusion of sparse lognormals”.

I could imagine you could set up a prompt that makes the network represent things in this format, at least in some fragments of it. However, that’s not what you need in order to interpret the network, because that’s not how people use the network in practice, so it wouldn’t be informative for how the network works.

Instead, an interpretation of the network would be constituted by a map which shows how different branches of the world impacted the network. In the simplest form, you could imagine slicing up the world into categories (e.g. plants, animals, fungi) and then decompose the weight vector of the network into a sum of that due to plants, due to animals, and due to fungi (and presumably also interaction terms and such).

Of course in practice people use LLMs in a pretty narrow range of scenarios that don’t really match plants/animals/fungi, and the training data is probably heavily skewed towards the animals (and especially humans) branch of this tree, so realistically you’d need some more pragmatic model.