tailcalled comments on Mechanistically Eliciting Latent Behaviors in Language Models

tailcalled 1 May 2024 7:46 UTC
LW: 4 AF: 2
0
AF
How important is it to use full-blown gradient descent to train them? Could one instead take the first singular vector for the Jacobian between the neural network layers, and get something that works similarly well?
- Andrew Mack 1 May 2024 16:07 UTC
  LW: 1 AF: 1
  0
  AF Parent
  I haven’t tried the first singular vector of the Jacobian between layers. But for p=2,q=1 I tried looking at the first few eigenvectors of the Hessian of the objective function (around $θ = 0$ ) on the bomb-making prompt for Qwen-1.8B. These didn’t appear to do anything interesting regardless of norm. So my feeling is that full-blown gradient descent is needed.
  - tailcalled 1 May 2024 16:19 UTC
    LW: 2 AF: 1
    0
    AF Parent
    The singular vectors of the Jacobian between two layers seems more similar to what you’re doing in the OP than the Hessian of the objective function does? Because the Hessian of the objective function sort of forces it all to be mediated by the final probabilities, which means it discounts directions in activation space that don’t change the probabilities yet, but would change the probabilities if the change in activations was scaled up beyond infinitesimal.
    
    Edit: wait, maybe I misunderstood, I assumed by the objective function you meant some cross-entropy on the token predictions, but I guess in-context it’s more likely you meant the objective function for the magnitude of change in later layer activations induced by a given activation vector?
    - Andrew Mack 3 May 2024 4:30 UTC
      LW: 1 AF: 1
      0
      AF Parent
      Yes, I meant the unsupervised steering objective (magnitude of downstream changes) as opposed to cross-entropy.
      - tailcalled 3 May 2024 6:23 UTC
        3 points
        0
        Parent
        Fair, it’s eigenvectors should be equivalent to the singular vectors of the Jacobian.