drocta comments on How Do Selection Theorems Relate To Interpretability?

drocta 10 Jun 2022 0:56 UTC
8 points
As another “why not just” which I’m sure there’s a reason for:
in the original circuits thread, they made a number of parameterized families of synthetic images which certain nodes in the network responded strongly to in a way that varied smoothly with the orientation parameter, and where these nodes detected e.g. boundaries between high-frequency and low-frequency regions at different orientations.
If given another such network of generally the same kind of architecture, if you gave that network the same images, if it also had analogous nodes, I’d expect those nodes to have much more similar responses to those images than any other nodes in the network. I would expect that cosine similarity of the “how strongly does this node respond to this image” would be able to pick out the node(s) in question fairly well? Perhaps I’m wrong about that.
And, of course, this idea seems only directly applicable to feed-forward convolution networks that take an image as the input, and so, not so applicable when trying to like, understand how an agent works, probably.
(well, maybe it would work in things that aren’t just a convolutions-and-pooling-and-dilation-etc , but seems like it would be hard to make the analogous synthetic inputs which exemplify the sort of thing that the node responds to, for inputs other than images. Especially if the inputs are from a particularly discrete space, like sentences or something. )
But, this makes me a bit unclear about why the “NP-HARD” lights start blinking.
Of course, “find isomorphic structure”, sure.
But, if we have a set of situations which exemplify when a given node does and does not fire (rather, when it activates more and when it activates less) in one network, searching another network for a node that does/doesn’t activate in those same situations, hardly seems NP-hard. Just check all the nodes for whether they do or don’t light up. And then, if you also have similar characterizations for what causes activation in the nodes that came before the given node in the first network, apply the same process with those on the nodes in the second network that come before the nodes that matched the closest.
I suppose if you want to give a overall score for each combination of “this sub-network of nodes in the new network corresponds to this other network of nodes in the old-and-understood network”, and find the sub-network that gives the best score, then, sure, there could be exponentially many sub-networks to consider. But, if each well-understood node in the old network generally has basically only one plausibly corresponding node in the new network, then this seems like it might not really be an issue in practice?
But, I don’t have any real experience with this kind of thing, and I could be totally off.
- johnswentworth 10 Jun 2022 1:00 UTC
  5 points
  Parent
  You’ve correctly identified most of the problems already. One missing piece: it’s not necessarily node-activations which are the right thing to look at. Even in existing work, there’s other ways interpretable information is embedded, like e.g. directions in activation space of a bunch of neurons, or rank-one updates to matrices.