Thanks for posting this! This seems to be important to balance dataset before training CCS probes.
Another strange thing is that accuracy of CCS degrades for auto-regressive models like GPT-J, LLAMA. For GPT-J it is about random choose performance as per the DLK paper (Collins et al, 2022), about 50-60%. And in the ITI paper (Kenneth et al, 2023) they chose linear regression probe instead of CCS, and say that CCS was so poor that it was near random (same as in the DLK paper). Do you have thoughts on that? Perhaps they used bad datasets as per your research?
I don’t think dataset imbalance is the cause of the poor performance for auto-regressive models when unsupervised methods are applied. I believe both papers enforced a 50⁄50 balance when applying CCS.
So why might a supervised probe succeed when CCS fails? My best guess is that, for the datasets considered in these papers, auto-regressive models do not have sufficiently salient representations of truth after constructing contrast pairs. Contrast pair construction does not guarantee isolating truth as the most salient feature difference between the positive and negative representations. For example, imagine for IMDB movie reviews, the model most saliently represents consistency between the last completion token (‘positive’/‘negative’) and positive or negative words in the review (‘good’, ‘great’, ‘bad’, ‘horrible’). Example: “Consider the following movie review: ‘This movie makes a great doorstop.’ The sentiment of this review is [positive|negative].” This ‘sentiment-consistency’ feature could be picked up by CCS if it is sufficiently salient, but would not align with truth.
Why this sort of situation might apply to auto-regressive models and not other models, I can’t say, but it’s certainly an interesting area of future research!
Thanks for posting this! This seems to be important to balance dataset before training CCS probes.
Another strange thing is that accuracy of CCS degrades for auto-regressive models like GPT-J, LLAMA. For GPT-J it is about random choose performance as per the DLK paper (Collins et al, 2022), about 50-60%. And in the ITI paper (Kenneth et al, 2023) they chose linear regression probe instead of CCS, and say that CCS was so poor that it was near random (same as in the DLK paper). Do you have thoughts on that? Perhaps they used bad datasets as per your research?
I don’t think dataset imbalance is the cause of the poor performance for auto-regressive models when unsupervised methods are applied. I believe both papers enforced a 50⁄50 balance when applying CCS.
So why might a supervised probe succeed when CCS fails? My best guess is that, for the datasets considered in these papers, auto-regressive models do not have sufficiently salient representations of truth after constructing contrast pairs. Contrast pair construction does not guarantee isolating truth as the most salient feature difference between the positive and negative representations. For example, imagine for IMDB movie reviews, the model most saliently represents consistency between the last completion token (‘positive’/‘negative’) and positive or negative words in the review (‘good’, ‘great’, ‘bad’, ‘horrible’). Example: “Consider the following movie review: ‘This movie makes a great doorstop.’ The sentiment of this review is [positive|negative].” This ‘sentiment-consistency’ feature could be picked up by CCS if it is sufficiently salient, but would not align with truth.
Why this sort of situation might apply to auto-regressive models and not other models, I can’t say, but it’s certainly an interesting area of future research!