Yes, I train them one at a time, constraining each new vector to be orthogonal to the older ones (this was not clear in the post, so thanks for asking!).
I haven’t experimented with this, but you could also imagine using only “soft” orthogonality constraints (e.g., penalizing pairwise cosine similarities between vectors).
This is an interesting question!
I just checked this. The cosine similarity of δ9 and δ22 is .52, which is much more similar than you’d expect from random vectors of the same dimensionality (this is computing the δ’s across all tokens and then flattening, which is how the objective was computed for the main refusal experiment in the post).
If you restrict to calculating δ’s at just the assistant tag at the end of the prompt, the cosine similarity between δ9 and δ22 goes up to .87.
Interestingly, the cosine similarities in δ‘s seems to be somewhat high across all pairs of steering vectors (mean of .25 across pairs, which is higher than random vectors which will be close to zero). This suggests it might be better to do some sort of soft orthogonality constraint over the δ’s (by penalizing pairwise cosine similarities) rather than a hard orthogonality constraint over the steering vectors, if you want to get better diversity across vectors. I’ll have to try this at some point.