One question that I had while reading the section on refusals:
Your method found two vectors (vectors 9 and 22) that seem to bypass refusal in the “real-world” setting.
While these vectors themselves are orthogonal (due to your imposed constraint), have you looked at the resulting downstream activation difference directions and checked if they are similar?
I.e. adding vector 9 at an early layer results in a downstream activation diff in the direction δ9, and adding vector 22 at an early layer results in a downstream activation diff in the direction δ22. Are these downstream activation diff directions δ9 and δ22 roughly the same? Or are they almost orthogonal?
(My prediction would be that they’re very similar.)
I just checked this. The cosine similarity of δ9 and δ22 is .52, which is much more similar than you’d expect from random vectors of the same dimensionality (this is computing the δ’s across all tokens and then flattening, which is how the objective was computed for the main refusal experiment in the post).
If you restrict to calculating δ’s at just the assistant tag at the end of the prompt, the cosine similarity between δ9 and δ22 goes up to .87.
Interestingly, the cosine similarities in δ‘s seems to be somewhat high across all pairs of steering vectors (mean of .25 across pairs, which is higher than random vectors which will be close to zero). This suggests it might be better to do some sort of soft orthogonality constraint over the δ’s (by penalizing pairwise cosine similarities) rather than a hard orthogonality constraint over the steering vectors, if you want to get better diversity across vectors. I’ll have to try this at some point.
Why do you guys think this is happening? It sounds to me like one possibility is that maybe the model might have some amount of ensembling (thinking back to The Clock and The Pizza where in a toy setting ensembling happened). W.r.t. “across all steering vectors” that’s pretty mysterious, but at least in the specific examples in the post even 9 was semi-fantasy.
Also what are ya’lls intuitions on picking layers for this stuff. I understand that you describe in the post that you control early layers because we suppose that they might be acting something like switches to different classes of functionality. However, implicit in layer 10 it seems like you probably don’t want to go too early because maybe in the very early layers it’s unembedding and learning basic concepts like whether a word is a noun or whatever. Do you choose layers based on experience tinkering in jupyter notebooks and the like, or have you run some sort of grid to get a notion of what the effects elsewhere are. If the latter, it would be nice to know to aid in hypothesis formation and the like.
Awesome work, and nice write-up!
One question that I had while reading the section on refusals:
Your method found two vectors (vectors 9 and 22) that seem to bypass refusal in the “real-world” setting.
While these vectors themselves are orthogonal (due to your imposed constraint), have you looked at the resulting downstream activation difference directions and checked if they are similar?
I.e. adding vector 9 at an early layer results in a downstream activation diff in the direction δ9, and adding vector 22 at an early layer results in a downstream activation diff in the direction δ22. Are these downstream activation diff directions δ9 and δ22 roughly the same? Or are they almost orthogonal?
(My prediction would be that they’re very similar.)
This is an interesting question!
I just checked this. The cosine similarity of δ9 and δ22 is .52, which is much more similar than you’d expect from random vectors of the same dimensionality (this is computing the δ’s across all tokens and then flattening, which is how the objective was computed for the main refusal experiment in the post).
If you restrict to calculating δ’s at just the assistant tag at the end of the prompt, the cosine similarity between δ9 and δ22 goes up to .87.
Interestingly, the cosine similarities in δ‘s seems to be somewhat high across all pairs of steering vectors (mean of .25 across pairs, which is higher than random vectors which will be close to zero). This suggests it might be better to do some sort of soft orthogonality constraint over the δ’s (by penalizing pairwise cosine similarities) rather than a hard orthogonality constraint over the steering vectors, if you want to get better diversity across vectors. I’ll have to try this at some point.
Why do you guys think this is happening? It sounds to me like one possibility is that maybe the model might have some amount of ensembling (thinking back to The Clock and The Pizza where in a toy setting ensembling happened). W.r.t. “across all steering vectors” that’s pretty mysterious, but at least in the specific examples in the post even 9 was semi-fantasy.
Also what are ya’lls intuitions on picking layers for this stuff. I understand that you describe in the post that you control early layers because we suppose that they might be acting something like switches to different classes of functionality. However, implicit in layer 10 it seems like you probably don’t want to go too early because maybe in the very early layers it’s unembedding and learning basic concepts like whether a word is a noun or whatever. Do you choose layers based on experience tinkering in jupyter notebooks and the like, or have you run some sort of grid to get a notion of what the effects elsewhere are. If the latter, it would be nice to know to aid in hypothesis formation and the like.