I don’t think dataset imbalance is the cause of the poor performance for auto-regressive models when unsupervised methods are applied. I believe both papers enforced a 50⁄50 balance when applying CCS.
So why might a supervised probe succeed when CCS fails? My best guess is that, for the datasets considered in these papers, auto-regressive models do not have sufficiently salient representations of truth after constructing contrast pairs. Contrast pair construction does not guarantee isolating truth as the most salient feature difference between the positive and negative representations. For example, imagine for IMDB movie reviews, the model most saliently represents consistency between the last completion token (‘positive’/‘negative’) and positive or negative words in the review (‘good’, ‘great’, ‘bad’, ‘horrible’). Example: “Consider the following movie review: ‘This movie makes a great doorstop.’ The sentiment of this review is [positive|negative].” This ‘sentiment-consistency’ feature could be picked up by CCS if it is sufficiently salient, but would not align with truth.
Why this sort of situation might apply to auto-regressive models and not other models, I can’t say, but it’s certainly an interesting area of future research!
I don’t think dataset imbalance is the cause of the poor performance for auto-regressive models when unsupervised methods are applied. I believe both papers enforced a 50⁄50 balance when applying CCS.
So why might a supervised probe succeed when CCS fails? My best guess is that, for the datasets considered in these papers, auto-regressive models do not have sufficiently salient representations of truth after constructing contrast pairs. Contrast pair construction does not guarantee isolating truth as the most salient feature difference between the positive and negative representations. For example, imagine for IMDB movie reviews, the model most saliently represents consistency between the last completion token (‘positive’/‘negative’) and positive or negative words in the review (‘good’, ‘great’, ‘bad’, ‘horrible’). Example: “Consider the following movie review: ‘This movie makes a great doorstop.’ The sentiment of this review is [positive|negative].” This ‘sentiment-consistency’ feature could be picked up by CCS if it is sufficiently salient, but would not align with truth.
Why this sort of situation might apply to auto-regressive models and not other models, I can’t say, but it’s certainly an interesting area of future research!