Interesting results, thanks for sharing! To clarify, what exactly are you doing after identifying a direction vector? Projecting and setting its coordinate to zero? Actively reversing it?
And how do these results compare to the dumb approach of just taking the gradient of the logit difference at that layer, and using that as your direction?
Some ad-hoc hypotheses for what might be going on:
An underlying thing is probably that the model is representing several correlated features—is_woman, is_wearing_a_dress, has_long_hair, etc. Even if you can properly isolate the is_woman direction, just deleting this may not matter that much, esp if the answer is obvious?
I’m not sure which method this will harm more though, since presumably they’ll pick up on a direction that’s the average of all features weighted by their correlation with is_woman, ish.
IMO, a better metric here is the difference in the she vs he logits, rather than just the accuracy—that may be a better way of picking up on whether you’ve found a meaningful direction?
GPT-2 is trained with dropout on the residual stream, which may fuck with things? It’s presumably learned at least some redundancy to cope with ablations.
To test this, try replicating on GPT-Neo 125M? That doesn’t have dropout.
Gender is probably just a pretty obvious thing, that’s fairly overdetermined, and breaking an overdetermined thing by removing a particular concept is hard.
I have a fuzzy intuition that it’s easier to break models than to insert/edit a concept? I’d weakly predict that your second method will damage model performance in general more, even on non-gender-y tasks. Maybe it’s learned to shove the model off distribution in a certain way, or exploits some feature of how the model is doing superposition where it adds in two features the model doesn’t expect to see at the same time, which fucks with things? Idk, these are all random ad-hoc guesses.
Another way of phrasing this—I’d weakly predict that for any set of, say, 10 prompts, even with no real connection, you could learn a direction that fucks with all of them, just because it’s not that hard to break a model with access to its internal activations (if this is false, I’d love to know!). Clearly, that direction isn’t going to have any conceptual meaning!
One concrete experiment to test the above—what happens if you reverse the direction, rather than removing it? (Ie, take v − 2(v . gender_dir) gender_dir, not v - (v . gender_dir) gender_dir). If RLACE can significantly reverse accuracy to favour she over he incorrectly, that feels interesting to me!
Actually, RLACE has a lesser impact on the representation space, since it removes just a rank-1 subspace.
Note that if we train a linear classifier w to convergence (as done in the first iteration of INLP), then by definition we can project the entire representation space over the direction w and retain the very same accuracy—because that subspace that is spanned by w is the only thing the linear classifier is “looking at”. We performed experiments similar in spirit to what you suggest with INLP in [this](https://arxiv.org/pdf/2105.06965.pdf) paper. In the attached image you can see the effect of a positive/negative intervention across layers:
In the experiments I ran with GPT-2, RLACE and INLP are both used with a rank-1 projection. So RLACE could have “more impact” if it removed a more important direction, which I think it does.
I know it’s not the intended use of INLP, but I got my inspiration from this technique, and that’s why I write INLP (Ravfogel, 2020) (the original technique removes multiple directions to obtain a measurable effect)
[Edit] Tell me if you prefer that I avoid calling the “linear classifier method” INLP (it isn’t actually iterated in the experiments I ran, but it is where I discovered the idea of using a linear classifier to project data to remove information)!
Thank you for the feedback! I ran some of the experiments you suggested and added them to the appendix of the post.
While I was running some of the experiments, I realized I had made a big mistake in my analysis: in fact, the direction which matter the most (found by RLACE) is the one with large changes (and not the one with crisp changes)! (I’ve edited the post to correct that mistake.)
What I’m actually doing is an affine projection: v←((v−m)−((v−m).d)d)+m where v is the activation, d the direction (normalized), and m is “the median in the direction of d” m=medianv(d.v)d.
Looking at the gradient might be a good idea, I haven’t tried it.
About your hypotheses:
Definitely something like that is going on, though I don’t think I capture most of the highly correlated features you might want to catch, since the text I use to find the direction is very basic.
You might be interested in two different kinds of metrics:
Is your classifier doing well on the activations? (This is the accuracy I report, I chose accuracy since it is easier to understand than the loss of a linear classifier)
Is your model actually outputting he in sentences about men, she in sentences about women, and is it confused in general about gender. I did measure something like the logit difference of he vs she (I actually measured probability ratios relative to the bigger probability to avoid giving to much weight to outliers), and found a “bias” (on the training data) of 0.87 (no bias is 0, max is 1) before edit, 0.73 after edit with RLACE, and 0.82 after edit with INLP. (I can give more detail about the metric if someone is interested.)
Dropout doesn’t seem to be the source of the effect: I ran the experiment with GPT-Neo-125M and found qualitatively similar results (see appendix).
Yes, gender might be hard. I’m open to suggestions for better concepts! Most concept are not as crisp as gender, which might make things harder. Indeed, the technique requires you to provide “positive” and “negative” sentences, ideally pairs of sentences which differ only by the target concept.
Breaking the model is one of the big things this technique does. But I find it interesting if you are able to break the model “more” when it comes to gender related subject, and it looks like this is happening (generation diverge slower when it’s not about gender). One observation providing evidence for “you’re mostly breaking the model”: in experiments where the concept is political left vs political right (see notebook in appendix), the model edited for gender produced weird results.
Great idea, swapping works remarkably well!
Eye balling the completions, the “swap” works better than the projection without breaking the model more than the projection (see appendix), and using the metric I described above, you get a bias of −0.29 (inverted bias) for the model edited with RLACE and 0.68 for the mode edited with INLP.
You can also use the opposite idea to increase bias (mutliply the importance of the direction by 2), and this somewhat works: you get a bias of 0.83 (down from 0.87) with RLACE, and 0.90 (up from 0.87) with INLP. INLP did increase the bias. RLACE has probably broken too many things to be able to be more biased than “reality”.
I think this is evidence for the fact that this technique is not just breaking the model.
Interesting results, thanks for sharing! To clarify, what exactly are you doing after identifying a direction vector? Projecting and setting its coordinate to zero? Actively reversing it?
And how do these results compare to the dumb approach of just taking the gradient of the logit difference at that layer, and using that as your direction?
Some ad-hoc hypotheses for what might be going on:
An underlying thing is probably that the model is representing several correlated features—is_woman, is_wearing_a_dress, has_long_hair, etc. Even if you can properly isolate the is_woman direction, just deleting this may not matter that much, esp if the answer is obvious?
I’m not sure which method this will harm more though, since presumably they’ll pick up on a direction that’s the average of all features weighted by their correlation with is_woman, ish.
IMO, a better metric here is the difference in the she vs he logits, rather than just the accuracy—that may be a better way of picking up on whether you’ve found a meaningful direction?
GPT-2 is trained with dropout on the residual stream, which may fuck with things? It’s presumably learned at least some redundancy to cope with ablations.
To test this, try replicating on GPT-Neo 125M? That doesn’t have dropout.
Gender is probably just a pretty obvious thing, that’s fairly overdetermined, and breaking an overdetermined thing by removing a particular concept is hard.
I have a fuzzy intuition that it’s easier to break models than to insert/edit a concept? I’d weakly predict that your second method will damage model performance in general more, even on non-gender-y tasks. Maybe it’s learned to shove the model off distribution in a certain way, or exploits some feature of how the model is doing superposition where it adds in two features the model doesn’t expect to see at the same time, which fucks with things? Idk, these are all random ad-hoc guesses.
Another way of phrasing this—I’d weakly predict that for any set of, say, 10 prompts, even with no real connection, you could learn a direction that fucks with all of them, just because it’s not that hard to break a model with access to its internal activations (if this is false, I’d love to know!). Clearly, that direction isn’t going to have any conceptual meaning!
One concrete experiment to test the above—what happens if you reverse the direction, rather than removing it? (Ie, take v − 2(v . gender_dir) gender_dir, not v - (v . gender_dir) gender_dir). If RLACE can significantly reverse accuracy to favour she over he incorrectly, that feels interesting to me!
Actually, RLACE has a lesser impact on the representation space, since it removes just a rank-1 subspace.
Note that if we train a linear classifier w to convergence (as done in the first iteration of INLP), then by definition we can project the entire representation space over the direction w and retain the very same accuracy—because that subspace that is spanned by w is the only thing the linear classifier is “looking at”. We performed experiments similar in spirit to what you suggest with INLP in [this](https://arxiv.org/pdf/2105.06965.pdf) paper. In the attached image you can see the effect of a positive/negative intervention across layers:
In the experiments I ran with GPT-2, RLACE and INLP are both used with a rank-1 projection. So RLACE could have “more impact” if it removed a more important direction, which I think it does.
I know it’s not the intended use of INLP, but I got my inspiration from this technique, and that’s why I write INLP (Ravfogel, 2020) (the original technique removes multiple directions to obtain a measurable effect)
[Edit] Tell me if you prefer that I avoid calling the “linear classifier method” INLP (it isn’t actually iterated in the experiments I ran, but it is where I discovered the idea of using a linear classifier to project data to remove information)!
Thank you for the feedback! I ran some of the experiments you suggested and added them to the appendix of the post.
While I was running some of the experiments, I realized I had made a big mistake in my analysis: in fact, the direction which matter the most (found by RLACE) is the one with large changes (and not the one with crisp changes)! (I’ve edited the post to correct that mistake.)
What I’m actually doing is an affine projection: v←((v−m)−((v−m).d)d)+m where v is the activation, d the direction (normalized), and m is “the median in the direction of d” m=medianv(d.v)d.
Looking at the gradient might be a good idea, I haven’t tried it.
About your hypotheses:
Definitely something like that is going on, though I don’t think I capture most of the highly correlated features you might want to catch, since the text I use to find the direction is very basic.
You might be interested in two different kinds of metrics:
Is your classifier doing well on the activations? (This is the accuracy I report, I chose accuracy since it is easier to understand than the loss of a linear classifier)
Is your model actually outputting he in sentences about men, she in sentences about women, and is it confused in general about gender. I did measure something like the logit difference of he vs she (I actually measured probability ratios relative to the bigger probability to avoid giving to much weight to outliers), and found a “bias” (on the training data) of 0.87 (no bias is 0, max is 1) before edit, 0.73 after edit with RLACE, and 0.82 after edit with INLP. (I can give more detail about the metric if someone is interested.)
Dropout doesn’t seem to be the source of the effect: I ran the experiment with GPT-Neo-125M and found qualitatively similar results (see appendix).
Yes, gender might be hard. I’m open to suggestions for better concepts! Most concept are not as crisp as gender, which might make things harder. Indeed, the technique requires you to provide “positive” and “negative” sentences, ideally pairs of sentences which differ only by the target concept.
Breaking the model is one of the big things this technique does. But I find it interesting if you are able to break the model “more” when it comes to gender related subject, and it looks like this is happening (generation diverge slower when it’s not about gender). One observation providing evidence for “you’re mostly breaking the model”: in experiments where the concept is political left vs political right (see notebook in appendix), the model edited for gender produced weird results.
Great idea, swapping works remarkably well!
Eye balling the completions, the “swap” works better than the projection without breaking the model more than the projection (see appendix), and using the metric I described above, you get a bias of −0.29 (inverted bias) for the model edited with RLACE and 0.68 for the mode edited with INLP.
You can also use the opposite idea to increase bias (mutliply the importance of the direction by 2), and this somewhat works: you get a bias of 0.83 (down from 0.87) with RLACE, and 0.90 (up from 0.87) with INLP. INLP did increase the bias. RLACE has probably broken too many things to be able to be more biased than “reality”.
I think this is evidence for the fact that this technique is not just breaking the model.