Is the idea with the cosine similarity to check whether similar prompt topics consistently end up yielding similar vectors in the embedding space across all the layers, and different topics end up in different parts of embedding space?
Yeah, I would say this is the main idea I was trying to get towards.
If that’s the idea, have you considered just logging which attention heads and MLP layers have notably high or notably low activations for different vs. similar topics instead?
I think I probably just look at the activations instead of the output + residual in further analysis, since it wasn’t particularly clear in the outputs of the fully-connected layer, or at least find a better metric than Cosine Similarity. Cosine Similarity probably won’t be too useful for analysis that is much deeper, but I think it was sort of useful for showing some trends.
I have also tried using a “scaled cosine similarity” metric, which shows essentially the same output, though preserves the relative length. (that is, instead of normalising each vector to 1, I rescaled each vector by the length of the largest vector, such that now the largest vector has length 1 and every other vector is smaller or equal in size).
With this metric, I think the graphs were slightly better, but the cosine similarity plots between different vectors had the behaviour of all vectors being more similar with the longest vector which I though made it more difficult to see the similarity on the graphs for small vectors, and felt like it would be more confusing to add some weird new metric. (Though now writing this, it now seems an obvious mistake that I should have just written the post with “scaled cosine similarity”, or possibly some better metric if I could find one, since it seems important here that two basically zero vectors should have a very high similarity, and this isn’t captured by either of these metrics). I might edit the post to add some extra graphs in an edited appendix, though this might also go into a separate post.
As for looking at the attention heads instead of the attention blocks, so far I haven’t seen that they are a particularly better unit for distinguishing between the different categories of text (though for this analysis so far I only looked at OPT-125M). When looking at outputs of the attention heads, and their cosine similarities, usually it seemed that the main difference was from a specific dimension being particularly bright, rather than attention heads lighting up to specific categories (when looking at the cosine similarity of the attention outputs). The magnitude of the activations also seemed pretty consistent between activation heads in the same layer (and was very small for most of the middle layers), except for the occasional high-magnitude dimension in the layers near the beginning and end.
I made some graphs that sort of show this. The indices 0-99 are the same as in the post.
Here is some results for attention head 5 from the attention block in the final decoder layer for OPT-125M:
The left image is the “scaled cosine similarity” between the (small) vectors (of size 64) put out by each attention head. The second image is the raw/unscaled values of the same output vectors, where each column represents an output vector.
Here are the same two plots, but instead for attention head 11 in the attention block of the final layer for OPT-125M:
I still think there might be some interesting things in the individual attention heads, (most likely in the key-query behaviour from what I have seen so far), but I will need to spend some more time doing analysis.
But if your hypothesis is specifically that there are different modules in the network for dealing with different kinds of prompt topics, that seems directly testable just by checking if some sections of the network “light up” or go dark in response to different prompts. Like a human brain in an MRI reacting to visual vs. auditory data.
This is the analogy I have had in my head when trying to do this, but I think a my methodology has not tracked this as well as I would have preferred. In particular, I still struggle to understand how residual streams can form notions of modularity in networks.
Yeah, I would say this is the main idea I was trying to get towards.
I think I probably just look at the activations instead of the output + residual in further analysis, since it wasn’t particularly clear in the outputs of the fully-connected layer, or at least find a better metric than Cosine Similarity. Cosine Similarity probably won’t be too useful for analysis that is much deeper, but I think it was sort of useful for showing some trends.
I have also tried using a “scaled cosine similarity” metric, which shows essentially the same output, though preserves the relative length. (that is, instead of normalising each vector to 1, I rescaled each vector by the length of the largest vector, such that now the largest vector has length 1 and every other vector is smaller or equal in size).
With this metric, I think the graphs were slightly better, but the cosine similarity plots between different vectors had the behaviour of all vectors being more similar with the longest vector which I though made it more difficult to see the similarity on the graphs for small vectors, and felt like it would be more confusing to add some weird new metric. (Though now writing this, it now seems an obvious mistake that I should have just written the post with “scaled cosine similarity”, or possibly some better metric if I could find one, since it seems important here that two basically zero vectors should have a very high similarity, and this isn’t captured by either of these metrics). I might edit the post to add some extra graphs in an edited appendix, though this might also go into a separate post.
As for looking at the attention heads instead of the attention blocks, so far I haven’t seen that they are a particularly better unit for distinguishing between the different categories of text (though for this analysis so far I only looked at OPT-125M). When looking at outputs of the attention heads, and their cosine similarities, usually it seemed that the main difference was from a specific dimension being particularly bright, rather than attention heads lighting up to specific categories (when looking at the cosine similarity of the attention outputs). The magnitude of the activations also seemed pretty consistent between activation heads in the same layer (and was very small for most of the middle layers), except for the occasional high-magnitude dimension in the layers near the beginning and end.
I made some graphs that sort of show this. The indices 0-99 are the same as in the post.
Here is some results for attention head 5 from the attention block in the final decoder layer for OPT-125M:
The left image is the “scaled cosine similarity” between the (small) vectors (of size 64) put out by each attention head. The second image is the raw/unscaled values of the same output vectors, where each column represents an output vector.
Here are the same two plots, but instead for attention head 11 in the attention block of the final layer for OPT-125M:
I still think there might be some interesting things in the individual attention heads, (most likely in the key-query behaviour from what I have seen so far), but I will need to spend some more time doing analysis.
This is the analogy I have had in my head when trying to do this, but I think a my methodology has not tracked this as well as I would have preferred. In particular, I still struggle to understand how residual streams can form notions of modularity in networks.