What I want are some human-meaningful features that can get combined in human-meaningful ways.
E.g. you take a photo of a duck, you take a feature that means “this photo was taken on a sunny day,” and then you do some operation to smush these together and you get a photo of a duck taken on a sunny day.
If features are vectors of fixed direction with size drawn from a distribution, which is my takeaway from the superposition paper, then the smushing-together operation is addition (maybe conditional on the dot product of the current image with the feature being above some threshold).
If the on-distribution data points get mapped to regions of the activation space with lots of large polytopes, how does this help us extract some commonality from a bunch of photos of sunny days, and then smush that commonality together with a photo of a duck to get a photo of a duck on a sunny day?
Not a rhetorical question, I’m trying to think through it. It’s just hard.
Maybe you’d think of the commonality between the sunny-day pictures in terms of their codes? You’d toss out the linear part and just say that what the sunny-day pictures have in common is that they have some subset of the nonlinearities that all tend to be in the same state. And so you could make the duck picture more sunny by flipping that subset of neurons to be closer to the sunny-day mask.
This is one of the major research questions that will be important to answer before polytopes can be really useful in mechanistic descriptions.
By choosing to use clustering rather than dimensionality reduction methods, we took a non-decompositional approach here. Clustering was motivated primarily by wanting to capture the monosemanticity of local regions in neural networks. But the ‘monosemanticity’ that I’m talking about here refers to the fact that small regions of activation mean one thing on one level of abstraction; this ‘one thing’ could be a combination of features. This therefore isn’t to say that small regions of activation space represent only one feature on a lower level of abstraction. Small regions of activation space (e.g. a group of nearby polytopes) might therefore exhibit multiple features on a particular level of abstraction, and clustering isn’t going to help us break apart that level of abstraction into its composite features.
Instead of clustering, it seems like it should be possible to find directions in spline code space, rather than directions in activation space. Spline codes can incorporate information about the pathway taken by activations through multiple layers, which means that spline-code-directions roughly correspond to ‘directions in pathway-space’. If directions in pathway-space don’t interact with each other (i.e. a neuron that’s involved in one direction in pathway-space isn’t involved in other directions in pathway-space), then I think we’d be able to understand how the network decomposes its function simply by adding different spline code directions together. But I strongly expect that spline-code-directions would interact with each other, in which case straightforward addition of spline-code-directions probably won’t always work. I’m not yet sure how best to get around this problem.
How would one use this to inform decomposition?
What I want are some human-meaningful features that can get combined in human-meaningful ways.
E.g. you take a photo of a duck, you take a feature that means “this photo was taken on a sunny day,” and then you do some operation to smush these together and you get a photo of a duck taken on a sunny day.
If features are vectors of fixed direction with size drawn from a distribution, which is my takeaway from the superposition paper, then the smushing-together operation is addition (maybe conditional on the dot product of the current image with the feature being above some threshold).
If the on-distribution data points get mapped to regions of the activation space with lots of large polytopes, how does this help us extract some commonality from a bunch of photos of sunny days, and then smush that commonality together with a photo of a duck to get a photo of a duck on a sunny day?
Not a rhetorical question, I’m trying to think through it. It’s just hard.
Maybe you’d think of the commonality between the sunny-day pictures in terms of their codes? You’d toss out the linear part and just say that what the sunny-day pictures have in common is that they have some subset of the nonlinearities that all tend to be in the same state. And so you could make the duck picture more sunny by flipping that subset of neurons to be closer to the sunny-day mask.
This is one of the major research questions that will be important to answer before polytopes can be really useful in mechanistic descriptions.
By choosing to use clustering rather than dimensionality reduction methods, we took a non-decompositional approach here. Clustering was motivated primarily by wanting to capture the monosemanticity of local regions in neural networks. But the ‘monosemanticity’ that I’m talking about here refers to the fact that small regions of activation mean one thing on one level of abstraction; this ‘one thing’ could be a combination of features. This therefore isn’t to say that small regions of activation space represent only one feature on a lower level of abstraction. Small regions of activation space (e.g. a group of nearby polytopes) might therefore exhibit multiple features on a particular level of abstraction, and clustering isn’t going to help us break apart that level of abstraction into its composite features.
Instead of clustering, it seems like it should be possible to find directions in spline code space, rather than directions in activation space. Spline codes can incorporate information about the pathway taken by activations through multiple layers, which means that spline-code-directions roughly correspond to ‘directions in pathway-space’. If directions in pathway-space don’t interact with each other (i.e. a neuron that’s involved in one direction in pathway-space isn’t involved in other directions in pathway-space), then I think we’d be able to understand how the network decomposes its function simply by adding different spline code directions together. But I strongly expect that spline-code-directions would interact with each other, in which case straightforward addition of spline-code-directions probably won’t always work. I’m not yet sure how best to get around this problem.