Arthur Conmy comments on Classifying representations of sparse autoencoders (SAEs)

Arthur Conmy 17 Nov 2023 14:31 UTC
1 point
0
Why do you think that the sentiment will not be linearly separable?
I would guess that something like multiplying residual stream states by $W_{U} [^{''} positive"] - W_{U} [^{''} negative"]$ (ie the logit difference under the Logit Lens) would be reasonable (possibly with hacks like the tuned lens)
- Annah 17 Nov 2023 19:25 UTC
  1 point
  0
  Parent
  I’m not quite sure what you mean with “the sentiment will not be linearly separable”.
  The hidden states are linearly separable (to some extend), but the sparse representations perform worse than the original representations in my experiment.
  I am training logistic regression classifiers on the original, and sparse representations respectively, so I am multiplying the residual stream states (and their sparse encodings) with weights. These weights could (but don’t have to) align with some meaningful direction like hidden_states(“positive”)-hidden_states(“negative”).
  I’m not sure if I understood your comment about the logit lens. Are you proposing this as an alternative way of testing for linear separability? But then shouldn’t the information already be encoded in the hidden states and thus extractable with a classifier?