Jacob G-W comments on Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models

Jacob G-W Dec 4, 2024, 1:26 AM
5 points
0
Nice work! A few questions:
I’m curious if you have found any multiplicity in the output directions (what you denote as ${\to u}_{l}^{*}$ ), or if the multiplicity is only in the input directions. I would predict that there would be some multiplicity in output directions, but much less than the multiplicity in input directions for the corresponding concept.
Relatedly, how do you think about output directions in general? Do you think they are just upweighting/downweighting tokens? I’d imagine that their level of abstraction depends on how far from the end of the network the output layer is, which will ultimately end up determining out much of their effect is directly on the unembed v.s. indirectly through other layers.
- Andrew Mack Dec 7, 2024, 1:10 AM
  2 points
  0
  Parent
  Regarding your first question (multiplicity of $u_{ℓ}$ ‘s, as compared with $v_{ℓ}$ ’s) - I would say that a priori my intuition matched yours (that there should be less multiplicity in output directions), but that the empirical evidence is mixed:
  
  Evidence for less output vs input multiplicity: In initial experiments, I found that orthogonalizing $^U$ led to less stable optimization curves, and to subjectively less interpretable features. This suggests that there is less multiplicity in output directions. (And in fact my suggestion above in algorithms ²⁄₃ is not to orthogonalize $^U$ ).
  
  Evidence for more (or at least the same) output vs input multiplicity: Taking the $^U$ from the same DCT for which I analyzed $^V$ multiplicity, and applying the same metrics for the top $240$ vectors, I get that the average of $| ⟨ {^u}_{ℓ}, {^u}_{ℓ^{'}} ⟩ |$ is $.25$ while the value for the ${^v}_{ℓ}$ ’s was $.36$ , so that on average the output directions are less similar to each other than the input directions (with the caveat that ideally I’d do the comparison over multiple runs and compute some sort of $p$ -value). Similarly, the condition number of $^U$ for that run is $27$ , less than the condition number of $^V$ of $38$ , so that $^U$ looks “less co-linear” than $^V$ .
  
  As for how to think about output directions, my guess is that at layer $t = 20$ in a $30$ -layer model, these features are not just upweighting/downweighting tokens but are doing something more abstract. I don’t have any hard empirical evidence for this though.