I wonder how much of these orthogonal vectors are “actually orthogonal” once we consider we are adding two vectors together, and that the model has things like LayerNorm.
If one conditions on downstream midlayer activations being “sufficiently different” it seems possible one could find like 10x degeneracy of actual effects these have on models. (A possibly relevant factor is how big the original activation vector is compared to the steering vector?)
I wonder how much of these orthogonal vectors are “actually orthogonal” once we consider we are adding two vectors together, and that the model has things like LayerNorm.
If one conditions on downstream midlayer activations being “sufficiently different” it seems possible one could find like 10x degeneracy of actual effects these have on models. (A possibly relevant factor is how big the original activation vector is compared to the steering vector?)