Thanks for sharing your findings—this was an interesting idea to test out! I played around with the notebook you linked to on this and noticed that the logistic regression training accuracy is also pretty low for earlier layers when using the encoded hidden representations. This was initially surprising (surely it should be easy to overfit with such a high dimensional input space and only ~1000 examples?) until I noticed that the number of ‘on’ features is pretty low, especially for early layer SAEs.
For example, the layer 2 SAE only has (the same) 2 features on over all examples in the dataset, so effectively you’re training a classifier after doing a dimensionality reduction down to 2 dimensions. This may be a tall order even if you used (say) PCA to choose those 2 dimensions, but in the case of the pretrained SAE those two dimensions were chosen to optimise reconstruction on the full data distribution (of which this dataset is rather unrepresentative). The upshot is that unless you’re lucky (and the SAE happened to pick features that correspond to sentiment), it makes sense you lose a lot of classification performance.
In contrast, the final SAEs have hundreds of features that are ‘on’ over the dataset, so even if none of those features directly relate to sentiment, the chances are good that you have preserved enough of the structure in the original hidden state to be able to recover sentiment. On the other hand, even at this end of the spectrum, note you haven’t really projected to a higher dimensional space—you’ve gone from ~1000 dimensions to a similar or fewer number of effective dimensions—so it’s not so surprising performance still doesn’t match training a classifier on the hidden states directly.
All in all, I think this gave me a couple of useful insights:
It’s important to have really, really high fidelity with SAEs if you want to keep L0 (number of on features) low while at the same time be able to use the SAE for very narrow distribution analysis. (E.g. in this case, if the layer 2 SAE really had encoded the concept of sentiment, then it wouldn’t have mattered that only 2 features were on on average across the dataset.)
I originally shared your initial hypothesis (about projecting to a higher dimensional space making concepts more separable), but have updated to thinking that I shouldn’t think of sparse “high dimensional” projections in the same way as dense projections. My new mental model for sparse projections is that you’re actually projecting down to a lower dimensional space, but where the projection is task dependent (i.e. the SAE’s relu chooses which projections it thinks are relevant). (Think of it a bit like a mixture of experts dimensionality reduction algorithm.) So the act of projection will only help with classification performance if the dimensions chosen by the filter are actually relevant to the problem (which requires a really good SAE), otherwise you’re likely to get worse performance than if you hadn’t projected at all.
Yeah, this makes a ton of sense. Thx for taking the time to give it a closer look and also your detailed response :)
So then in order for the SAE to be useful I’d have to train it on a lot of sentiment data and then I could maybe discover some interpretable sentiment related features that could help me understand why a model thinks a review is positive/negative...
Thanks for sharing your findings—this was an interesting idea to test out! I played around with the notebook you linked to on this and noticed that the logistic regression training accuracy is also pretty low for earlier layers when using the encoded hidden representations. This was initially surprising (surely it should be easy to overfit with such a high dimensional input space and only ~1000 examples?) until I noticed that the number of ‘on’ features is pretty low, especially for early layer SAEs.
For example, the layer 2 SAE only has (the same) 2 features on over all examples in the dataset, so effectively you’re training a classifier after doing a dimensionality reduction down to 2 dimensions. This may be a tall order even if you used (say) PCA to choose those 2 dimensions, but in the case of the pretrained SAE those two dimensions were chosen to optimise reconstruction on the full data distribution (of which this dataset is rather unrepresentative). The upshot is that unless you’re lucky (and the SAE happened to pick features that correspond to sentiment), it makes sense you lose a lot of classification performance.
In contrast, the final SAEs have hundreds of features that are ‘on’ over the dataset, so even if none of those features directly relate to sentiment, the chances are good that you have preserved enough of the structure in the original hidden state to be able to recover sentiment. On the other hand, even at this end of the spectrum, note you haven’t really projected to a higher dimensional space—you’ve gone from ~1000 dimensions to a similar or fewer number of effective dimensions—so it’s not so surprising performance still doesn’t match training a classifier on the hidden states directly.
All in all, I think this gave me a couple of useful insights:
It’s important to have really, really high fidelity with SAEs if you want to keep L0 (number of on features) low while at the same time be able to use the SAE for very narrow distribution analysis. (E.g. in this case, if the layer 2 SAE really had encoded the concept of sentiment, then it wouldn’t have mattered that only 2 features were on on average across the dataset.)
I originally shared your initial hypothesis (about projecting to a higher dimensional space making concepts more separable), but have updated to thinking that I shouldn’t think of sparse “high dimensional” projections in the same way as dense projections. My new mental model for sparse projections is that you’re actually projecting down to a lower dimensional space, but where the projection is task dependent (i.e. the SAE’s relu chooses which projections it thinks are relevant). (Think of it a bit like a mixture of experts dimensionality reduction algorithm.) So the act of projection will only help with classification performance if the dimensions chosen by the filter are actually relevant to the problem (which requires a really good SAE), otherwise you’re likely to get worse performance than if you hadn’t projected at all.
Yeah, this makes a ton of sense. Thx for taking the time to give it a closer look and also your detailed response :)
So then in order for the SAE to be useful I’d have to train it on a lot of sentiment data and then I could maybe discover some interpretable sentiment related features that could help me understand why a model thinks a review is positive/negative...