Cool paper. I think the semantic similarity result is particularly interesting.
As I understand it you’ve got a circuit that wants to calculate something like Sim(A,B), where A and B might have many “senses” aka: features but the Sim might not be a linear function of each of thes Sims across all senses/features.
So for example, there are senses in which “Berkeley” and “California” are geographically related, and there might be a few other senses in which they are semantically related but probably none that really matter for copy suppression. For this reason wouldn’t expect the tokens of each of to have cosine similarity that is predictive of the copy suppression score. This would only happen for really “mono-semantic tokens” that have only one sense (maybe you could test that).
Moreover, there are also tokens which you might want to ignore when doing copy suppression (speculatively). Eg: very common words or punctuations (the/and/etc).
I’d be interested if you have use something like SAE’s to decompose the tokens into the underlying feature/s present at different intensities in each of these tokens (or the activations prior to the key/query projections). Follow up experiments could attempt to determine whether copy suppression could be better understood when the semantic subspaces are known. Some things that might be cool here: - Show that some features are mapped to the null space of keys/queries in copy suppression heads indicating semantic senses / features that are ignored by copy suppression. Maybe multiple anti-induction heads compose (within or between layers) so that if one maps a feature to the null space, another doesn’t (or some linear combination) or via a more complicated function of sets of features being used to inform suppression. - Similarly, show that the OV circuit is suppressing the same features/features you think are being used to determine semantic similarity. If there’s some asymmetry here, that could be interesting as it would correspond to “I calculate A and B as similar by their similarity in the *california axis* but I suppress predictions of any token that has the feature for anywhere on the West Coast*).
I’m particularly excited about this because it might represent a really good way to show how knowing features informs the quality of mechanistic explanations.
Cool paper. I think the semantic similarity result is particularly interesting.
As I understand it you’ve got a circuit that wants to calculate something like Sim(A,B), where A and B might have many “senses” aka: features but the Sim might not be a linear function of each of thes Sims across all senses/features.
So for example, there are senses in which “Berkeley” and “California” are geographically related, and there might be a few other senses in which they are semantically related but probably none that really matter for copy suppression. For this reason wouldn’t expect the tokens of each of to have cosine similarity that is predictive of the copy suppression score. This would only happen for really “mono-semantic tokens” that have only one sense (maybe you could test that).
Moreover, there are also tokens which you might want to ignore when doing copy suppression (speculatively). Eg: very common words or punctuations (the/and/etc).
I’d be interested if you have use something like SAE’s to decompose the tokens into the underlying feature/s present at different intensities in each of these tokens (or the activations prior to the key/query projections). Follow up experiments could attempt to determine whether copy suppression could be better understood when the semantic subspaces are known. Some things that might be cool here:
- Show that some features are mapped to the null space of keys/queries in copy suppression heads indicating semantic senses / features that are ignored by copy suppression. Maybe multiple anti-induction heads compose (within or between layers) so that if one maps a feature to the null space, another doesn’t (or some linear combination) or via a more complicated function of sets of features being used to inform suppression.
- Similarly, show that the OV circuit is suppressing the same features/features you think are being used to determine semantic similarity. If there’s some asymmetry here, that could be interesting as it would correspond to “I calculate A and B as similar by their similarity in the *california axis* but I suppress predictions of any token that has the feature for anywhere on the West Coast*).
I’m particularly excited about this because it might represent a really good way to show how knowing features informs the quality of mechanistic explanations.