Although it would be useful to have the plotting code as well, if that’s easy to share?
Sure! I’ve just pushed the plot helper routines we used, as well as some examples.
I agree that N (true feature dimension) > d (observed dimension), and that sparsity will be high, but I’m uncertain whether the other part of the regime (that you don’t mention here), that k (model latent dimension) > N, is likely to be true. Do you think that is likely to be the case? As an analogy, I think the intermediate feature dimensions in MLP layers in transformers (analogously k) are much lower dimension than the “true intrinsic dimension of features in natural language” (analogously N), even if it is larger than the input dimension (embedding dimension* num_tokens, analogously d). So I expect N>k>d,
This is a great question. I think my expectation is that the number of features exceeds the number of neurons in real-world settings, but that it might be possible to arrange for the number of neurons to exceed the number of important features (at least if we use some sparsity/gating methods to get many more neurons without many more flops).
If we can’t get into that limit, though, it does seem important to know what happens when k < N, and we looked briefly at that limit in section 4.1.4. There we found that models tend to learn some features monosemantically and others polysemantically (rather than e.g. ignoring all but k features and learning those monosemantically), both for uniform and varying feature frequencies.
This is definitely something we want to look into more though, in particular in case of power-law (or otherwise varying) feature frequencies/importances. You might well expect that features just get ignored below some threshold and monosemantically represented above it, or it could be that you just always get a polysemantic morass in that limit. We haven’t really pinned this down, and it may also depend on the initialization/training procedure (as we saw when k > N), so it’s definitely worth a thorough look.
In the paper you say that you weakly believe that monosemantic and polysemantic network parametrisations are likely in different loss basins, given they’re implementing very different algorithms. I think (given the size of your networks) it should be easy to test for at least linear mode connectivity with something like git re-basin (https://github.com/samuela/git-re-basin).
We haven’t tried this. It’s something we looked into briefly but were a little concerned about going down a rabbit hole given some of the discussion around whether the results replicated, which indicated some sensitivity to optimizer and learning rate.
I think (at least in our case) it might be simpler to get at this question, and I think the first thing I’d do to understand connectivity is ask “how much regularization do I need to move from one basin to the other?” So for instance suppose we regularized the weights to directly push them from one basin towards the other, how much regularization do we need to make the models actually hop?
Actually, one related reason we think that these basins are unlikely to be closely connected is that we see the monosemanticity “converge” towards the end of long training runs, rather than e.g. drifting as the model moves along a ridge. We don’t see this convergence everywhere, and in particular in high-k power-law models we see continuing evolution after many steps, but we think we understand that as a refinement of a particular minimum to better capture infrequent features.
You also mentioned your initial attempts at sparsity through a hard-coded initially sparse matrix failed; I’d be very curious to see whether a lottery ticket-style iterative magnitude pruning was able to produce sparse matrices from the high-latent-dimension monosemantic networks that are still monosemantic, or more broadly how the LTH interacts with polysemanticity—are lottery tickets less polysemantic, or more, or do they not really change the monosemanticity?
Good question! We haven’t tried that precise experiment, but have tried something quite similar. Specifically, we’ve got some preliminary results from a prune-and-grow strategy (holding sparsity fixed, pruning smallest-magnitude weights, enabling non-sparse weights) that does much better than a fixed sparsity strategy.
I’m not quite sure how to interpret these results in terms of the lottery ticket hypothesis though. What evidence would you find useful to test it?
You might well expect that features just get ignored below some threshold and monosemantically represented above it, or it could be that you just always get a polysemantic morass in that limit
I guess the recent work on Polysemanticity and Capacity seems to suggest the latter case, especially in sparser settings, given the zone where multiple feature are represented polysemantically, although I can’t remember if they investigate power-law feature frequencies or just uniform frequencies
were a little concerned about going down a rabbit hole given some of the discussion around whether the results replicated, which indicated some sensitivity to optimizer and learning rate.
My impression is that that discussion was more about whether the empirical results (i.e. do ResNets have linear mode connectivity?) held up, rather than whether the methodology used and present in the code base could be used to find whether linear mode connectivity is present between two models (up to permutation) for a given dataset. I imagine you could take the code and easily adapt it to check for LMC between two trained models pretty quickly (it’s something I’m considering trying to do as well, hence the code requests).
I think (at least in our case) it might be simpler to get at this question, and I think the first thing I’d do to understand connectivity is ask “how much regularization do I need to move from one basin to the other?” So for instance suppose we regularized the weights to directly push them from one basin towards the other, how much regularization do we need to make the models actually hop?
That would defiitely be interesting to see. I guess this is kind of presupposing that the models are in different basins (which I also believe but hasn’t yet been verified). I also think looking at basins and connectivity would be more interesting in the case where there was more noise, either from initialisation, inherently in the data, or by using a much lower batch size so that SGD was noisy. In this case it’s less likely that the same configuration results in the same basin, but if your interventions are robust to these kinds of noise then it’s a good sign.
Good question! We haven’t tried that precise experiment, but have tried something quite similar. Specifically, we’ve got some preliminary results from a prune-and-grow strategy (holding sparsity fixed, pruning smallest-magnitude weights, enabling non-sparse weights) that does much better than a fixed sparsity strategy.
I’m not quite sure how to interpret these results in terms of the lottery ticket hypothesis though. What evidence would you find useful to test it?
That’s cool, looking forward to seeing more detail. I think these results don’t seem that related to the LTH (if I understand your explanation correctly), as LTH involves finding sparse subnetworks in dense ones. Possibly it only actually holds in model with many more parameters, I haven’t seen it investigated in models that aren’t overparametrised in a classical sense.
I think if iterative magnitude pruning (IMP) on these problems produced much sparse subnetworks that also maintained the monosemanticity levels, then that would suggest that sparsity doesn’t penalise monosemanticity (or polysemanticity) in this toy model, and also (much more speculatively) that the sparse well-performing subnetworks that IMP finds in other networks possibly also maintain their levels of poly/mono-semanticity. If we also think these networks are favoured towards poly or mono, then that hints at how the overall learning process if favoured towards poly or mono.
Thanks for these thoughts!
Sure! I’ve just pushed the plot helper routines we used, as well as some examples.
This is a great question. I think my expectation is that the number of features exceeds the number of neurons in real-world settings, but that it might be possible to arrange for the number of neurons to exceed the number of important features (at least if we use some sparsity/gating methods to get many more neurons without many more flops).
If we can’t get into that limit, though, it does seem important to know what happens when k < N, and we looked briefly at that limit in section 4.1.4. There we found that models tend to learn some features monosemantically and others polysemantically (rather than e.g. ignoring all but k features and learning those monosemantically), both for uniform and varying feature frequencies.
This is definitely something we want to look into more though, in particular in case of power-law (or otherwise varying) feature frequencies/importances. You might well expect that features just get ignored below some threshold and monosemantically represented above it, or it could be that you just always get a polysemantic morass in that limit. We haven’t really pinned this down, and it may also depend on the initialization/training procedure (as we saw when k > N), so it’s definitely worth a thorough look.
We haven’t tried this. It’s something we looked into briefly but were a little concerned about going down a rabbit hole given some of the discussion around whether the results replicated, which indicated some sensitivity to optimizer and learning rate.
I think (at least in our case) it might be simpler to get at this question, and I think the first thing I’d do to understand connectivity is ask “how much regularization do I need to move from one basin to the other?” So for instance suppose we regularized the weights to directly push them from one basin towards the other, how much regularization do we need to make the models actually hop?
Actually, one related reason we think that these basins are unlikely to be closely connected is that we see the monosemanticity “converge” towards the end of long training runs, rather than e.g. drifting as the model moves along a ridge. We don’t see this convergence everywhere, and in particular in high-k power-law models we see continuing evolution after many steps, but we think we understand that as a refinement of a particular minimum to better capture infrequent features.
Good question! We haven’t tried that precise experiment, but have tried something quite similar. Specifically, we’ve got some preliminary results from a prune-and-grow strategy (holding sparsity fixed, pruning smallest-magnitude weights, enabling non-sparse weights) that does much better than a fixed sparsity strategy.
I’m not quite sure how to interpret these results in terms of the lottery ticket hypothesis though. What evidence would you find useful to test it?
I guess the recent work on Polysemanticity and Capacity seems to suggest the latter case, especially in sparser settings, given the zone where multiple feature are represented polysemantically, although I can’t remember if they investigate power-law feature frequencies or just uniform frequencies
My impression is that that discussion was more about whether the empirical results (i.e. do ResNets have linear mode connectivity?) held up, rather than whether the methodology used and present in the code base could be used to find whether linear mode connectivity is present between two models (up to permutation) for a given dataset. I imagine you could take the code and easily adapt it to check for LMC between two trained models pretty quickly (it’s something I’m considering trying to do as well, hence the code requests).
That would defiitely be interesting to see. I guess this is kind of presupposing that the models are in different basins (which I also believe but hasn’t yet been verified). I also think looking at basins and connectivity would be more interesting in the case where there was more noise, either from initialisation, inherently in the data, or by using a much lower batch size so that SGD was noisy. In this case it’s less likely that the same configuration results in the same basin, but if your interventions are robust to these kinds of noise then it’s a good sign.
That’s cool, looking forward to seeing more detail. I think these results don’t seem that related to the LTH (if I understand your explanation correctly), as LTH involves finding sparse subnetworks in dense ones. Possibly it only actually holds in model with many more parameters, I haven’t seen it investigated in models that aren’t overparametrised in a classical sense.
I think if iterative magnitude pruning (IMP) on these problems produced much sparse subnetworks that also maintained the monosemanticity levels, then that would suggest that sparsity doesn’t penalise monosemanticity (or polysemanticity) in this toy model, and also (much more speculatively) that the sparse well-performing subnetworks that IMP finds in other networks possibly also maintain their levels of poly/mono-semanticity. If we also think these networks are favoured towards poly or mono, then that hints at how the overall learning process if favoured towards poly or mono.