Erik Jenner comments on Internal Interfaces Are a High-Priority Interpretability Target

Erik Jenner Dec 29, 2022, 9:20 PM
7 points
1
I think this is an interesting direction and I’ve been thinking about pretty similar things (or more generally, “quotient” interpretability research). I’m planning to write much more in the future, but not sure when that will be, so here are some unorganized quick thoughts in the meantime:
- Considering the internal interfaces of a program/neural net/circuit/… is a special case of the more general idea of describing how a program/… works at a higher level of abstraction. For example, for circuits (and in particular neural networks), we could think of the “interface abstraction” as a quotient on vertices. I.e. we partition vertices into submodules, and then throw away all the information about how each submodule computes its outputs, considering only the interfaces. This corresponds to using a quotient graph of the original computational graph. From this perspective, interfaces are one very sensible abstraction of a computational graph one could consider, but not the only one. So besides interfaces, I’m also interested in what abstractions of programs/computational graphs can look like more generally.
- This also highlights that you can have submodules and thus interfaces at different levels of abstractions. In programs, you might have small helper functions, which are composed to more complicated methods, which are part of classes, which are part of larger modules. In a computational graph, you could have refinements of partitions of vertices. I’d consider the examples from section 4 pretty high-level submodules, and I think somewhat lower-level ones would also be interesting.
- You mention that the interfaces themselves have structure (“data formats”). Perhaps this could be modeled by looking at interfaces at different levels of abstractions as mentioned in the previous bullet point. I.e. one high-level interface would be made up of several low-level interfaces. This is pure guesswork though, I haven’t tried to work out anything like that yet.
- When talking about what “good interfaces” or “good submodules” are, a common approach is that we want interfaces to be comparatively sparse. An argument for why this might be desirable is that it makes the abstracted computational graph consisting only of interfaces easier to understand. But if I imagine a high-level description of how a neural network works that’s actually human-understandable, the key aspect seems to be that the high-level description should be in terms of human concepts. This suggests that the important thing is that the information at the interfaces can be well approximated using compact descriptions in terms of human concepts. In some hand-wavy way, it seems that (some versions of) NAH should imply that these two desiderata are the same: if we look for submodules as things with sparse interfaces, we also get human-understandable concepts represented at the interfaces. I think formalizing this claim could be a good milestone for conceptual research on modularity/abstractions/...