Jacob G-W comments on I found >800 orthogonal “write code” steering vectors

Jacob G-W 16 Jul 2024 17:08 UTC
4 points
0
After looking more into the outputs, I think the KL-divergence plots are slightly misleading. In the code and jailbreak cases, they do seem to show when the vectors stop becoming meaningful. But in the alien and STEM problem cases, they don’t show when the vectors stop becoming meaningful (there seem to be ~800 alien and STEM problem vectors also). The magnitude plots seem much more helpful there. I’m still confused about why the KL-divergence plots aren’t as meaningful in those cases, but maybe it has to do with the distribution of language that the vectors the model into? Coding is clearly a very different distribution of language than English, but Jailbreak is not that different a distribution of language than English. So I’m still confused here. But the KL-divergences are also only on the logits at the last token position, so maybe it’s just a small sample size.