Yeah, intervening on the entire activation history makes sense. It’s just honestly surprising to me that taking the largest singular vectors even mostly preserves semantic meaning. To my intuition, the thing that preserving large singular vectors preserves is this linear-algebra property about the matrix as a transformation, which feels different from an information-theoretic property about the matrix as an array of numbers.
I agree that it is intriguing. Even if I’m currently testing the method on more established datasets, my intuition of why it works is as follows:
Singular vectors correspond to the principal directions of data variance. In the context of LLMs, these directions capture significant patterns of moral and ethical reasoning embedded within the model’s activations.
While it seems that the largest singular vectors only preserve linear-algebra properties, they also encapsulate high-dimensional data structure, which I’d argue includes semantic and syntactic information. At the end of the day, these vectors represent directions along which the model activations vary the most, inherently encoding important semantic information.
Yeah, intervening on the entire activation history makes sense. It’s just honestly surprising to me that taking the largest singular vectors even mostly preserves semantic meaning. To my intuition, the thing that preserving large singular vectors preserves is this linear-algebra property about the matrix as a transformation, which feels different from an information-theoretic property about the matrix as an array of numbers.
I agree that it is intriguing. Even if I’m currently testing the method on more established datasets, my intuition of why it works is as follows:
Singular vectors correspond to the principal directions of data variance. In the context of LLMs, these directions capture significant patterns of moral and ethical reasoning embedded within the model’s activations.
While it seems that the largest singular vectors only preserve linear-algebra properties, they also encapsulate high-dimensional data structure, which I’d argue includes semantic and syntactic information. At the end of the day, these vectors represent directions along which the model activations vary the most, inherently encoding important semantic information.