Andrew Mack comments on Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack 7 Dec 2024 1:10 UTC
2 points
0
Regarding your first question (multiplicity of $u_{ℓ}$ ‘s, as compared with $v_{ℓ}$ ’s) - I would say that a priori my intuition matched yours (that there should be less multiplicity in output directions), but that the empirical evidence is mixed:

Evidence for less output vs input multiplicity: In initial experiments, I found that orthogonalizing $^U$ led to less stable optimization curves, and to subjectively less interpretable features. This suggests that there is less multiplicity in output directions. (And in fact my suggestion above in algorithms ²⁄₃ is not to orthogonalize $^U$ ).

Evidence for more (or at least the same) output vs input multiplicity: Taking the $^U$ from the same DCT for which I analyzed $^V$ multiplicity, and applying the same metrics for the top $240$ vectors, I get that the average of $| ⟨ {^u}_{ℓ}, {^u}_{ℓ^{'}} ⟩ |$ is $.25$ while the value for the ${^v}_{ℓ}$ ’s was $.36$ , so that on average the output directions are less similar to each other than the input directions (with the caveat that ideally I’d do the comparison over multiple runs and compute some sort of $p$ -value). Similarly, the condition number of $^U$ for that run is $27$ , less than the condition number of $^V$ of $38$ , so that $^U$ looks “less co-linear” than $^V$ .

As for how to think about output directions, my guess is that at layer $t = 20$ in a $30$ -layer model, these features are not just upweighting/downweighting tokens but are doing something more abstract. I don’t have any hard empirical evidence for this though.