Why do you think that the sentiment will not be linearly separable?
I would guess that something like multiplying residual stream states by WU[′′ positive"]−WU[′′ negative"] (ie the logit difference under the Logit Lens) would be reasonable (possibly with hacks like the tuned lens)
I’m not quite sure what you mean with “the sentiment will not be linearly separable”.
The hidden states are linearly separable (to some extend), but the sparse representations perform worse than the original representations in my experiment.
I am training logistic regression classifiers on the original, and sparse representations respectively, so I am multiplying the residual stream states (and their sparse encodings) with weights. These weights could (but don’t have to) align with some meaningful direction like hidden_states(“positive”)-hidden_states(“negative”).
I’m not sure if I understood your comment about the logit lens. Are you proposing this as an alternative way of testing for linear separability? But then shouldn’t the information already be encoded in the hidden states and thus extractable with a classifier?
Why do you think that the sentiment will not be linearly separable?
I would guess that something like multiplying residual stream states by WU[′′ positive"]−WU[′′ negative"] (ie the logit difference under the Logit Lens) would be reasonable (possibly with hacks like the tuned lens)
I’m not quite sure what you mean with “the sentiment will not be linearly separable”.
The hidden states are linearly separable (to some extend), but the sparse representations perform worse than the original representations in my experiment.
I am training logistic regression classifiers on the original, and sparse representations respectively, so I am multiplying the residual stream states (and their sparse encodings) with weights. These weights could (but don’t have to) align with some meaningful direction like hidden_states(“positive”)-hidden_states(“negative”).
I’m not sure if I understood your comment about the logit lens. Are you proposing this as an alternative way of testing for linear separability? But then shouldn’t the information already be encoded in the hidden states and thus extractable with a classifier?