But there is still a mystery I don’t fully understand: how is it possible to find so many “noise” vectors that don’t influence the output of the network much.
In unrelated experiments I found that steering into a (uniform) random direction is much less effective, than steering into a random direction sampled with same covariance as the real activations. This suggests that there might be a lot of directions[1] that don’t influence the output of the network much. This was on GPT2 but I’d expect it to generalize for other Transformers.
Though I don’t know how much space / what the dimensionality of that space is; I’m judging this by the “sensitivity curve” (how much steering is needed for a noticeable change in KL divergence).
In unrelated experiments I found that steering into a (uniform) random direction is much less effective, than steering into a random direction sampled with same covariance as the real activations. This suggests that there might be a lot of directions[1] that don’t influence the output of the network much. This was on GPT2 but I’d expect it to generalize for other Transformers.
Though I don’t know how much space / what the dimensionality of that space is; I’m judging this by the “sensitivity curve” (how much steering is needed for a noticeable change in KL divergence).