Alejandro Tlaie comments on Introducing SARA: a new activation steering technique

Alejandro Tlaie 16 Jun 2024 18:45 UTC
1 point
0
Hi Charlie, thanks a lot for taking the time to read the post and for the question!

Regarding what was the idea of changing the activation histories: I wanted to capture token dependencies, as I thought that concepts that weren’t captured by one token only (as in the case of ActAdd) would be better described by these history-dependent activations. As to why bringing the 3 relevant activation histories to the same size: that’s for enabling comparison (and, ultimately, similarity).

Regarding why SVD: I decided to use SVD as it’s one of the simplest and most ubiquitous matrix factorisation techniques out there (so I didn’t need to validate it or benchmark it). Also, it allows for not-so-heavy computations, which is crucial because SARA is thought to be implemented at inference time.
- Charlie Steiner 20 Jun 2024 16:40 UTC
  2 points
  0
  Parent
  Yeah, intervening on the entire activation history makes sense. It’s just honestly surprising to me that taking the largest singular vectors even mostly preserves semantic meaning. To my intuition, the thing that preserving large singular vectors preserves is this linear-algebra property about the matrix as a transformation, which feels different from an information-theoretic property about the matrix as an array of numbers.
  - Alejandro Tlaie 25 Jun 2024 12:16 UTC
    1 point
    0
    Parent
    I agree that it is intriguing. Even if I’m currently testing the method on more established datasets, my intuition of why it works is as follows:
    
    Singular vectors correspond to the principal directions of data variance. In the context of LLMs, these directions capture significant patterns of moral and ethical reasoning embedded within the model’s activations.
    
    While it seems that the largest singular vectors only preserve linear-algebra properties, they also encapsulate high-dimensional data structure, which I’d argue includes semantic and syntactic information. At the end of the day, these vectors represent directions along which the model activations vary the most, inherently encoding important semantic information.