Andrew Mack comments on Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack 1 May 2024 16:07 UTC
LW: 1 AF: 1
0
AF
I haven’t tried the first singular vector of the Jacobian between layers. But for p=2,q=1 I tried looking at the first few eigenvectors of the Hessian of the objective function (around $θ = 0$ ) on the bomb-making prompt for Qwen-1.8B. These didn’t appear to do anything interesting regardless of norm. So my feeling is that full-blown gradient descent is needed.
- tailcalled 1 May 2024 16:19 UTC
  LW: 2 AF: 1
  0
  AF Parent
  The singular vectors of the Jacobian between two layers seems more similar to what you’re doing in the OP than the Hessian of the objective function does? Because the Hessian of the objective function sort of forces it all to be mediated by the final probabilities, which means it discounts directions in activation space that don’t change the probabilities yet, but would change the probabilities if the change in activations was scaled up beyond infinitesimal.
  
  Edit: wait, maybe I misunderstood, I assumed by the objective function you meant some cross-entropy on the token predictions, but I guess in-context it’s more likely you meant the objective function for the magnitude of change in later layer activations induced by a given activation vector?
  - Andrew Mack 3 May 2024 4:30 UTC
    LW: 1 AF: 1
    0
    AF Parent
    Yes, I meant the unsupervised steering objective (magnitude of downstream changes) as opposed to cross-entropy.
    - tailcalled 3 May 2024 6:23 UTC
      3 points
      0
      Parent
      Fair, it’s eigenvectors should be equivalent to the singular vectors of the Jacobian.