Is the idea with the cosine similarity to check whether similar prompt topics consistently end up yielding similar vectors in the embedding space across all the layers, and different topics end up in different parts of embedding space?
Because individual transformer layers are assumed to only act on specific sub-spaces of the embedding space, and write their results back into the residual stream, so if you can show that different topics end up in different sub-spaces of the stream, you effectively show that different attention heads and MLPs must be dealing with them, meaning totally different parts of the network are active for different prompts?
If that’s the idea, have you considered just logging which attention heads and MLP layers have notably high or notably low activations for different vs. similar topics instead?
This wouldn’t be an option if we were looking for modularity in general, becausethe network might still have different modules for dealing with different computational steps of processing the same sort of prompt.
But if your hypothesis is specifically that there are different modules in the network for dealing with differentkinds of prompt topics, that seems directly testable just by checking if some sections of the network “light up” or go dark in response to different prompts. Like a human brain in an MRI reacting to visual vs. auditory data.
On first glance, this seems to me like it’d be more precise and reliable than trying to check whether different prompt topics land in different parts of the residual stream, and should also yield more research bits, because you actually get to immediately see what the modules are.
Or am I just misunderstanding what you’re trying to get at here?
but one main issue is that analysis doesn’t generalise particularly well to neural networks that are not easy to graph ( i.e: not Multi-Layer Perceptrons ( MLPs ) ).
Quibble: I suspect that the measure may have a lot of problems for MLPs as well. Graph theory implicitly assumes that all nodes are equivalent. If you connect two nodes to a third node with positive weights, graph theory thinks the connecti-ness from the two adds up.
But in neural networks, two different neurons can have both entirely different, or very similar features. If these features have big “parallel[1]” components with opposite sign, connecting to those neurons with positive weights can have the signals from both mostly cancel each other out[2].
Is the idea with the cosine similarity to check whether similar prompt topics consistently end up yielding similar vectors in the embedding space across all the layers, and different topics end up in different parts of embedding space?
Yeah, I would say this is the main idea I was trying to get towards.
If that’s the idea, have you considered just logging which attention heads and MLP layers have notably high or notably low activations for different vs. similar topics instead?
I think I probably just look at the activations instead of the output + residual in further analysis, since it wasn’t particularly clear in the outputs of the fully-connected layer, or at least find a better metric than Cosine Similarity. Cosine Similarity probably won’t be too useful for analysis that is much deeper, but I think it was sort of useful for showing some trends.
I have also tried using a “scaled cosine similarity” metric, which shows essentially the same output, though preserves the relative length. (that is, instead of normalising each vector to 1, I rescaled each vector by the length of the largest vector, such that now the largest vector has length 1 and every other vector is smaller or equal in size).
With this metric, I think the graphs were slightly better, but the cosine similarity plots between different vectors had the behaviour of all vectors being more similar with the longest vector which I though made it more difficult to see the similarity on the graphs for small vectors, and felt like it would be more confusing to add some weird new metric. (Though now writing this, it now seems an obvious mistake that I should have just written the post with “scaled cosine similarity”, or possibly some better metric if I could find one, since it seems important here that two basically zero vectors should have a very high similarity, and this isn’t captured by either of these metrics). I might edit the post to add some extra graphs in an edited appendix, though this might also go into a separate post.
As for looking at the attention heads instead of the attention blocks, so far I haven’t seen that they are a particularly better unit for distinguishing between the different categories of text (though for this analysis so far I only looked at OPT-125M). When looking at outputs of the attention heads, and their cosine similarities, usually it seemed that the main difference was from a specific dimension being particularly bright, rather than attention heads lighting up to specific categories (when looking at the cosine similarity of the attention outputs). The magnitude of the activations also seemed pretty consistent between activation heads in the same layer (and was very small for most of the middle layers), except for the occasional high-magnitude dimension in the layers near the beginning and end.
I made some graphs that sort of show this. The indices 0-99 are the same as in the post.
Here is some results for attention head 5 from the attention block in the final decoder layer for OPT-125M:
The left image is the “scaled cosine similarity” between the (small) vectors (of size 64) put out by each attention head. The second image is the raw/unscaled values of the same output vectors, where each column represents an output vector.
Here are the same two plots, but instead for attention head 11 in the attention block of the final layer for OPT-125M:
I still think there might be some interesting things in the individual attention heads, (most likely in the key-query behaviour from what I have seen so far), but I will need to spend some more time doing analysis.
But if your hypothesis is specifically that there are different modules in the network for dealing with different kinds of prompt topics, that seems directly testable just by checking if some sections of the network “light up” or go dark in response to different prompts. Like a human brain in an MRI reacting to visual vs. auditory data.
This is the analogy I have had in my head when trying to do this, but I think a my methodology has not tracked this as well as I would have preferred. In particular, I still struggle to understand how residual streams can form notions of modularity in networks.
Is the idea with the cosine similarity to check whether similar prompt topics consistently end up yielding similar vectors in the embedding space across all the layers, and different topics end up in different parts of embedding space?
Because individual transformer layers are assumed to only act on specific sub-spaces of the embedding space, and write their results back into the residual stream, so if you can show that different topics end up in different sub-spaces of the stream, you effectively show that different attention heads and MLPs must be dealing with them, meaning totally different parts of the network are active for different prompts?
If that’s the idea, have you considered just logging which attention heads and MLP layers have notably high or notably low activations for different vs. similar topics instead?
This wouldn’t be an option if we were looking for modularity in general, because the network might still have different modules for dealing with different computational steps of processing the same sort of prompt.
But if your hypothesis is specifically that there are different modules in the network for dealing with different kinds of prompt topics, that seems directly testable just by checking if some sections of the network “light up” or go dark in response to different prompts. Like a human brain in an MRI reacting to visual vs. auditory data.
On first glance, this seems to me like it’d be more precise and reliable than trying to check whether different prompt topics land in different parts of the residual stream, and should also yield more research bits, because you actually get to immediately see what the modules are.
Or am I just misunderstanding what you’re trying to get at here?
Quibble: I suspect that the measure may have a lot of problems for MLPs as well. Graph theory implicitly assumes that all nodes are equivalent. If you connect two nodes to a third node with positive weights, graph theory thinks the connecti-ness from the two adds up.
But in neural networks, two different neurons can have both entirely different, or very similar features. If these features have big “parallel[1]” components with opposite sign, connecting to those neurons with positive weights can have the signals from both mostly cancel each other out[2].
Parallel in function space that is.
I think the solution to this might be to look at the network layers in a basis where all features are orthogonal.
Yeah, I would say this is the main idea I was trying to get towards.
I think I probably just look at the activations instead of the output + residual in further analysis, since it wasn’t particularly clear in the outputs of the fully-connected layer, or at least find a better metric than Cosine Similarity. Cosine Similarity probably won’t be too useful for analysis that is much deeper, but I think it was sort of useful for showing some trends.
I have also tried using a “scaled cosine similarity” metric, which shows essentially the same output, though preserves the relative length. (that is, instead of normalising each vector to 1, I rescaled each vector by the length of the largest vector, such that now the largest vector has length 1 and every other vector is smaller or equal in size).
With this metric, I think the graphs were slightly better, but the cosine similarity plots between different vectors had the behaviour of all vectors being more similar with the longest vector which I though made it more difficult to see the similarity on the graphs for small vectors, and felt like it would be more confusing to add some weird new metric. (Though now writing this, it now seems an obvious mistake that I should have just written the post with “scaled cosine similarity”, or possibly some better metric if I could find one, since it seems important here that two basically zero vectors should have a very high similarity, and this isn’t captured by either of these metrics). I might edit the post to add some extra graphs in an edited appendix, though this might also go into a separate post.
As for looking at the attention heads instead of the attention blocks, so far I haven’t seen that they are a particularly better unit for distinguishing between the different categories of text (though for this analysis so far I only looked at OPT-125M). When looking at outputs of the attention heads, and their cosine similarities, usually it seemed that the main difference was from a specific dimension being particularly bright, rather than attention heads lighting up to specific categories (when looking at the cosine similarity of the attention outputs). The magnitude of the activations also seemed pretty consistent between activation heads in the same layer (and was very small for most of the middle layers), except for the occasional high-magnitude dimension in the layers near the beginning and end.
I made some graphs that sort of show this. The indices 0-99 are the same as in the post.
Here is some results for attention head 5 from the attention block in the final decoder layer for OPT-125M:
The left image is the “scaled cosine similarity” between the (small) vectors (of size 64) put out by each attention head. The second image is the raw/unscaled values of the same output vectors, where each column represents an output vector.
Here are the same two plots, but instead for attention head 11 in the attention block of the final layer for OPT-125M:
I still think there might be some interesting things in the individual attention heads, (most likely in the key-query behaviour from what I have seen so far), but I will need to spend some more time doing analysis.
This is the analogy I have had in my head when trying to do this, but I think a my methodology has not tracked this as well as I would have preferred. In particular, I still struggle to understand how residual streams can form notions of modularity in networks.