William_S comments on I found >800 orthogonal “write code” steering vectors

William_S 15 Jul 2024 19:25 UTC
11 points
−6
Hypothesis: each of these vectors representing a single token that is usually associated with code, vectors says “I should output this token soon”, and the model then plans around that to produce code. But adding vectors representing code tokens doesn’t necessarily produce another vector representing a code token, so that’s why you don’t see compositionality. Does somewhat seem plausible that there might be ~800 “code tokens” in the representation space.