StefanHex comments on I found >800 orthogonal “write code” steering vectors

StefanHex 16 Jul 2024 12:33 UTC
6 points
2
But there is still a mystery I don’t fully understand: how is it possible to find so many “noise” vectors that don’t influence the output of the network much.
In unrelated experiments I found that steering into a (uniform) random direction is much less effective, than steering into a random direction sampled with same covariance as the real activations. This suggests that there might be a lot of directions^[1] that don’t influence the output of the network much. This was on GPT2 but I’d expect it to generalize for other Transformers.
1. ^
  Though I don’t know how much space / what the dimensionality of that space is; I’m judging this by the “sensitivity curve” (how much steering is needed for a noticeable change in KL divergence).