I think these flaws point towards that when we do interpretability, we more want to impose some structure on the network. That is, we want to find some set of conditions that could occur in reality, where we can know that if these conditions occur, the network satisfies some useful property (such as “usually classifies things correctly”).
The main difficulty with this is, it requires a really good understanding of reality?
There we go!
So, one item on my list of posts to maybe get around to writing at some point is about what’s missing from current work on interpretability, what bottlenecks would need to be addressed to get the kind of interpretability we ideally want for application to alignment, and how True Names in general and natural abstraction specifically fit into the picture.
The OP got about half the picture: current methods mostly don’t have a good ground truth. People use toy environments to work around that, but then we don’t know how well tools will generalize to real-world structures which are certainly more complex and might even be differently complex.
The other half of the picture is: what would a good ground truth for interpretability even look like? And as you say, the answer involves a really good understanding of reality.
Unpacking a bit more: “interpret” is a two-part word. We see a bunch of floating-point numbers in a net, and we interpret them as an inner optimizer, or we interpret them as a representation of a car, or we interpret them as fourier components of some signal, or …. Claim: the ground truth for an interpretability method is a True Name of whatever we’re interpreting the floating-point numbers as. The ground truth for an interpretability method which looks for inner optimizers is, roughly speaking, a True Name of inner optimization. The ground truth for an interpretability method which looks for representations of cars is, roughly speaking, a True Name of cars (which presumably routes through some version of natural abstraction). The reason we have good ground truths for interpretability in various toy problems is because we already know the True Names of all the key things involved in those toy problems—like e.g. modular addition and Fourier components.
There we go!
So, one item on my list of posts to maybe get around to writing at some point is about what’s missing from current work on interpretability, what bottlenecks would need to be addressed to get the kind of interpretability we ideally want for application to alignment, and how True Names in general and natural abstraction specifically fit into the picture.
The OP got about half the picture: current methods mostly don’t have a good ground truth. People use toy environments to work around that, but then we don’t know how well tools will generalize to real-world structures which are certainly more complex and might even be differently complex.
The other half of the picture is: what would a good ground truth for interpretability even look like? And as you say, the answer involves a really good understanding of reality.
Unpacking a bit more: “interpret” is a two-part word. We see a bunch of floating-point numbers in a net, and we interpret them as an inner optimizer, or we interpret them as a representation of a car, or we interpret them as fourier components of some signal, or …. Claim: the ground truth for an interpretability method is a True Name of whatever we’re interpreting the floating-point numbers as. The ground truth for an interpretability method which looks for inner optimizers is, roughly speaking, a True Name of inner optimization. The ground truth for an interpretability method which looks for representations of cars is, roughly speaking, a True Name of cars (which presumably routes through some version of natural abstraction). The reason we have good ground truths for interpretability in various toy problems is because we already know the True Names of all the key things involved in those toy problems—like e.g. modular addition and Fourier components.