tailcalled comments on Mechanistically Eliciting Latent Behaviors in Language Models

tailcalled 1 May 2024 16:19 UTC
LW: 2 AF: 1
0
AF
The singular vectors of the Jacobian between two layers seems more similar to what you’re doing in the OP than the Hessian of the objective function does? Because the Hessian of the objective function sort of forces it all to be mediated by the final probabilities, which means it discounts directions in activation space that don’t change the probabilities yet, but would change the probabilities if the change in activations was scaled up beyond infinitesimal.

Edit: wait, maybe I misunderstood, I assumed by the objective function you meant some cross-entropy on the token predictions, but I guess in-context it’s more likely you meant the objective function for the magnitude of change in later layer activations induced by a given activation vector?
- Andrew Mack 3 May 2024 4:30 UTC
  LW: 1 AF: 1
  0
  AF Parent
  Yes, I meant the unsupervised steering objective (magnitude of downstream changes) as opposed to cross-entropy.
  - tailcalled 3 May 2024 6:23 UTC
    3 points
    0
    Parent
    Fair, it’s eigenvectors should be equivalent to the singular vectors of the Jacobian.