The singular vectors of the Jacobian between two layers seems more similar to what you’re doing in the OP than the Hessian of the objective function does? Because the Hessian of the objective function sort of forces it all to be mediated by the final probabilities, which means it discounts directions in activation space that don’t change the probabilities yet, but would change the probabilities if the change in activations was scaled up beyond infinitesimal.
Edit: wait, maybe I misunderstood, I assumed by the objective function you meant some cross-entropy on the token predictions, but I guess in-context it’s more likely you meant the objective function for the magnitude of change in later layer activations induced by a given activation vector?
The singular vectors of the Jacobian between two layers seems more similar to what you’re doing in the OP than the Hessian of the objective function does? Because the Hessian of the objective function sort of forces it all to be mediated by the final probabilities, which means it discounts directions in activation space that don’t change the probabilities yet, but would change the probabilities if the change in activations was scaled up beyond infinitesimal.
Edit: wait, maybe I misunderstood, I assumed by the objective function you meant some cross-entropy on the token predictions, but I guess in-context it’s more likely you meant the objective function for the magnitude of change in later layer activations induced by a given activation vector?
Yes, I meant the unsupervised steering objective (magnitude of downstream changes) as opposed to cross-entropy.
Fair, it’s eigenvectors should be equivalent to the singular vectors of the Jacobian.