Lee Sharkey comments on Interpreting Neural Networks through the Polytope Lens

Lee Sharkey 27 Sep 2022 19:09 UTC
LW: 4 AF: 2
0
AF
This is one of the major research questions that will be important to answer before polytopes can be really useful in mechanistic descriptions.

By choosing to use clustering rather than dimensionality reduction methods, we took a non-decompositional approach here. Clustering was motivated primarily by wanting to capture the monosemanticity of local regions in neural networks. But the ‘monosemanticity’ that I’m talking about here refers to the fact that small regions of activation mean one thing on one level of abstraction; this ‘one thing’ could be a combination of features. This therefore isn’t to say that small regions of activation space represent only one feature on a lower level of abstraction. Small regions of activation space (e.g. a group of nearby polytopes) might therefore exhibit multiple features on a particular level of abstraction, and clustering isn’t going to help us break apart that level of abstraction into its composite features.

Instead of clustering, it seems like it should be possible to find directions in spline code space, rather than directions in activation space. Spline codes can incorporate information about the pathway taken by activations through multiple layers, which means that spline-code-directions roughly correspond to ‘directions in pathway-space’. If directions in pathway-space don’t interact with each other (i.e. a neuron that’s involved in one direction in pathway-space isn’t involved in other directions in pathway-space), then I think we’d be able to understand how the network decomposes its function simply by adding different spline code directions together. But I strongly expect that spline-code-directions would interact with each other, in which case straightforward addition of spline-code-directions probably won’t always work. I’m not yet sure how best to get around this problem.