I haven’t tried the first singular vector of the Jacobian between layers. But for p=2,q=1 I tried looking at the first few eigenvectors of the Hessian of the objective function (around θ=0) on the bomb-making prompt for Qwen-1.8B. These didn’t appear to do anything interesting regardless of norm. So my feeling is that full-blown gradient descent is needed.
The singular vectors of the Jacobian between two layers seems more similar to what you’re doing in the OP than the Hessian of the objective function does? Because the Hessian of the objective function sort of forces it all to be mediated by the final probabilities, which means it discounts directions in activation space that don’t change the probabilities yet, but would change the probabilities if the change in activations was scaled up beyond infinitesimal.
Edit: wait, maybe I misunderstood, I assumed by the objective function you meant some cross-entropy on the token predictions, but I guess in-context it’s more likely you meant the objective function for the magnitude of change in later layer activations induced by a given activation vector?
I haven’t tried the first singular vector of the Jacobian between layers. But for p=2,q=1 I tried looking at the first few eigenvectors of the Hessian of the objective function (around θ=0) on the bomb-making prompt for Qwen-1.8B. These didn’t appear to do anything interesting regardless of norm. So my feeling is that full-blown gradient descent is needed.
The singular vectors of the Jacobian between two layers seems more similar to what you’re doing in the OP than the Hessian of the objective function does? Because the Hessian of the objective function sort of forces it all to be mediated by the final probabilities, which means it discounts directions in activation space that don’t change the probabilities yet, but would change the probabilities if the change in activations was scaled up beyond infinitesimal.
Edit: wait, maybe I misunderstood, I assumed by the objective function you meant some cross-entropy on the token predictions, but I guess in-context it’s more likely you meant the objective function for the magnitude of change in later layer activations induced by a given activation vector?
Yes, I meant the unsupervised steering objective (magnitude of downstream changes) as opposed to cross-entropy.
Fair, it’s eigenvectors should be equivalent to the singular vectors of the Jacobian.