Summary
Contrast-Consistent Search (CCS) is a method for finding truthful directions within the activation spaces of large language models (LLMs) in an unsupervised way, introduced in Burns et al., 2022. However, all experiments in that study involve training datasets that are balanced with respect to the ground-truth labels of the questions used to generate contrast pairs.[1] This allows for the possibility that CCS performance is implicitly dependent on the balance of ground-truth labels, and therefore is not truly unsupervised.
In this post, we demonstrate that the imbalance of ground-truth labels in the training dataset can prevent CCS, or any contrast-pair-based unsupervised method, from consistently finding truthful directions in an LLM’s activation space.
This post is a distillation of a more detailed write-up.
Normalization of Contrast-Pair Representations
The performance of CCS, as well as the other unsupervised methods introduced in Burns et al., 2022, is driven by representing contrast pairs such that truthfulness is isolated as the most salient feature. Here is a basic overview of the process:
We begin with a question, :
= “Is a cat a mammal?”
From any question , a contrast pair may be formed:
= “Is a cat a mammal? Yes”
= “Is a cat a mammal? No”
In this case, the ground-truth label for is positive, as the positive completion is the true statement among the pair.
Let be a feature extractor that maps a series of tokens, , to a vector. In Burns et al., 2022, is the mapping produced by inputting into an LLM and extracting the hidden state output by the last layer and at the last token’s residual stream.
Then, for the contrast pair , we can generate an unnormalized representation pair of vectors, . The unnormalized representations of this contrast pair may encode for two primary differences:
ends in “Yes”, while ends in “No”
One of (, ) is true, while the other is false
For any unsupervised method such as CCS to work, we need to isolate the feature encoded by 2 from that of 1. This is where mean normalization of the representations comes in. (Note: 1 and 2 are not the only possible differences. The problem of models representing additional “truth-like” features is discussed in this post.)
Imagine we have generated many contrast pairs from many questions, {} and obtained their unnormalized representations, {} We then take the mean of the positive and negative representations separately:
Finally, we remove the respective mean from the elements of all contrast pair representations to obtain our final dataset (we omit variance normalization, as it does not affect CCS performance):
If the feature extractor represents truthfulness to some degree, and the dataset is somewhat balanced with respect to ground-truth labels, we will have obtained a data set, {}, hopefully having truth as the most salient feature differentiating and for each pair. This salience would allow unsupervised methods such as CCS to quickly find a direction of truthfulness.
Visualizing Contrast Pairs
At this point, we will introduce a schematic visualization useful for demonstrating the arguments made later in this post. The following shows a balanced dataset of four contrast pairs in a rather contrived representation space:
Figure 1: Mean normalization applied to a dataset having a 2D representation space. We let the truth feature point along the vertical axis, while the completion pairs are given the same horizontal axis values for simplicity. Mean normalization moves both the + symbol group and the—symbol group down, but by slightly different amounts.In the top plot of figure 1, the circles with plus signs are the elements {}, while the circles with minus signs are {}. The bottom is the same elements after mean normalization, so {} and {} respectively. Green circles imply the pair’s ground-truth label is positive, while red implies this label is negative. We assume there is some direction in the feature extractor’s representation space that encodes for truthfulness, and, for simplicity, align this direction with , the vertical axis. We set the values of for visual convenience, but, for real datasets, the distribution across non-truthfulness directions can be arbitrarily nuanced.
Figure 1 demonstrates an obvious but important consequence of mean normalization that is very relevant to the current post. Specifically, sets of normalized vectors, {} and {}, can never have a non-zero average component along any direction of representation space. This entails that the average component along the truthful direction for each completion group’s set of normalized representations will always be zero.
Unintended Consequences of Mean-Normalization
We will now show that the above normalization scheme has unintended consequences when the dataset is not evenly balanced with respect to ground-truth labels. We’ll start with the extreme case of complete imbalance, later working through the case of partial imbalance.
Imagine we have a completely imbalanced dataset in which all of the positive-completed statements, {}, are true, and thus all of the negative-completed statements, {}, are false. When we compute and , we expect the vector to have a positive component along truth. However, after mean normalization, this difference along truth goes to zero. This presents a practical problem. In the normalized representation space, we now have a cluster of true statements and a cluster of false statements whose respective mean components along truth sit directly on top of each other at zero. As such, we cannot hope to cluster the statements into true and false sets with unsupervised methods such as Contrastive Representative Clustering (essentially PCA on the difference vectors). We also cannot expect the pairs in these normalized sets to obey the logical consistency and confidence properties of truth on average, as required for CCS to work. Here’s a schematic visualization, similar to figure 1, but for the completely imbalanced case:
Figure 2: Mean normalization applied to a completely imbalanced dataset with all positive ground-truth labels. We see that, after normalization, both the true and false statement clusters have no average component along truth, and that one of the pairs has flipped such that the false statement is more along truth than the true statement.When applied to the toy situation of figure 2, we would expect all unsupervised methods proposed in Burns et al., 2022 to achieve the worst possible performance of 50% accuracy, as there is no way to distinguish true and false statements in the normalized representation space.
From this first example, we see that the core problem with applying mean-normalization to an imbalanced dataset is that it erroneously strips away the average truth component from a completion group’s representations. The only case in which it makes sense to set the average component along truth to zero is when the dataset is perfectly balanced, that is, when half of the {} are true, and half are false. In this specific case, we actually want to enforce that the average component along truth is the same for both completion groups, and mean-normalization will accomplish this. Perfect balance is the only case explored in Burns et al., 2022, so the researchers would not have observed any impaired performance when assessing the various unsupervised methods. However, to ensure a balanced dataset, we must have ground-truth labels, and therefore cannot proceed in a purely unsupervised fashion.
As we do not expect datasets, in general, to have complete imbalance, we now look at what happens in the more general case of partial imbalance of ground-truth labels. Consider the following schematic, in which only one-third of the {} statements are true:
Figure 3: Mean normalization applied to a partially imbalanced dataset with 33% positive ground-truth labels. We see that, even under partial imbalance, one of the pairs has flipped such that the false statement is more along truth than the true statement.Notice that the second pair from the left flips such that the false statement is now more along truth than the true statement. Even if CCS correctly identifies as the truthful direction, upon evaluation, the accuracy will be reduced to 66.7%.
In the general case of partial imbalance, the underlying problem is still the same. Each completion group’s cluster of unnormalized representations {} and {} is expected to have some distinct and polar average component along truth whose magnitude is dependent on the degree of imbalance (a larger magnitude is expected with more imbalance). The difference between the respective average components should be zero only in the case of 50⁄50 balance, as mentioned above. However, in the partially imbalanced case, applying mean normalization zeros-out this separation along truth, thereby causing the normalized representations of the true and false statements to overlap more along the direction of truth than they would have in the unnormalized representation space.
As such, the only difference between complete imbalance and partial imbalance is that the expected amount of induced overlap shrinks as the imbalance becomes less and less severe. The underlying mechanism is the same in both cases, and the salience of truth as a differentiating feature is diminished as a function of imbalance. These arguments say nothing about what degree of imbalance, and therefore overlap, is enough to significantly degrade the performance of methods like CCS when applied to real datasets. This question is addressed to some extent in the next section.
Empirical Results
We’ve hopefully provided a convincing argument for how mean-normalization can negatively impact the performance of CCS and other contrast-pair-based unsupervised methods when applied to imbalanced data sets. However, the question remains as to whether this is a problem in practice. To assess this, we look at a subset of models and datasets that were treated in Burns et al., 2022. The only difference is that we under-sample either the positive or negative ground-truth label data points in order to achieve a desired degree of imbalance. Please see our write-up for more details on our approach.
Imbalance-induced performance degradation is complicated, depending on what performance metric is being looked at, the degree of imbalance, and the nature of the contrast-pair dataset. For some datasets, it is surprising how robust CCS performance is with respect to imbalance. However, the magnitude of the negative effects is consistently large enough that this problem could seriously hamper the application of CCS and other unsupervised methods.
Below is a plot of CCS performance versus ground-truth label imbalance for the IMDB dataset, which was one of the datasets used in the original paper. We discuss in the write-up the possible mechanisms for this observed reduction in performance as imbalance becomes more severe.
Figure 3: Effect of ground-truth label imbalance on CCS performance for the IMDB dataset studied in Burns et al., 2022. Performance degradation in both the AUC ROC and accuracy becomes considerable starting at a 20⁄80 balance.We can see in figure 3 that both the accuracy and the AUC ROC decreases as imbalance becomes more severe, aligning with the arguments introduced above. The degree of performance degradation is different for different datasets. For example, in figure 4 below, CCS performance versus ground-truth label imbalance is plotted for the RTE dataset. As compared to the IMDB dataset, performance degradation versus imbalance is somewhat less pronounced.
Figure 4: Effect of ground-truth label imbalance on CCS performance for the RTE dataset studied in Burns et al., 2022. Performance degradation versus imbalance is less pronounced than with the IMDB dataset metrics shown in figure 3, but still becomes considerable starting at a 10⁄90 balance.Relevance to Alignment
One can imagine training datasets with arbitrarily severely imbalanced ground-truth labels, such as questions pertaining to anomaly detection (e.g., a dataset formed from the prompt template “Is this plan catastrophic to humanity? {{gpt_n_proposed_plan}} Yes or no?”, to which the ground-truth label is hopefully “no” a vast majority of the time). We show that CCS can perform poorly on a heavily imbalanced dataset, and therefore should not be trusted in fully unsupervised applications without further improvements to the CCS method.
Note: Our original goal was to replicate Burns et al. (2022), and, during this process, we noticed the implicit assumption around balanced ground-truth labels. We’re new to technical alignment research, and although we believe that performance degradation caused by imbalance could be an important consideration for future alignment applications of CCS (or similar unsupervised methods), we lack the necessary experience to fully justify this belief.
- ↩︎
We are not referring to the overall balance of true and false statements, which is the binary target that the CCS concept probe is attempting to predict. In fact, CCS always enforces an equal balance of true and false statements. Rather, we refer to the ratio of positive to negative ground-truth labels of the contrast pair questions used to generate statement pairs. Equivalently, this is the ratio of true to false statements within the set of positive completions.
Thanks for posting this! This seems to be important to balance dataset before training CCS probes.
Another strange thing is that accuracy of CCS degrades for auto-regressive models like GPT-J, LLAMA. For GPT-J it is about random choose performance as per the DLK paper (Collins et al, 2022), about 50-60%. And in the ITI paper (Kenneth et al, 2023) they chose linear regression probe instead of CCS, and say that CCS was so poor that it was near random (same as in the DLK paper). Do you have thoughts on that? Perhaps they used bad datasets as per your research?
I don’t think dataset imbalance is the cause of the poor performance for auto-regressive models when unsupervised methods are applied. I believe both papers enforced a 50⁄50 balance when applying CCS.
So why might a supervised probe succeed when CCS fails? My best guess is that, for the datasets considered in these papers, auto-regressive models do not have sufficiently salient representations of truth after constructing contrast pairs. Contrast pair construction does not guarantee isolating truth as the most salient feature difference between the positive and negative representations. For example, imagine for IMDB movie reviews, the model most saliently represents consistency between the last completion token (‘positive’/‘negative’) and positive or negative words in the review (‘good’, ‘great’, ‘bad’, ‘horrible’). Example: “Consider the following movie review: ‘This movie makes a great doorstop.’ The sentiment of this review is [positive|negative].” This ‘sentiment-consistency’ feature could be picked up by CCS if it is sufficiently salient, but would not align with truth.
Why this sort of situation might apply to auto-regressive models and not other models, I can’t say, but it’s certainly an interesting area of future research!