This is really nice work. I was excited to see that ClassMeans did so well, especially on removal—comparably to LEACE?! I would have expected LEACE to do significantly better.
Logistic regression performs significantly worse than other methods in later layers.
Logistic direction steering also underwhelmed compared to (basically) ClassMeans in the inference-time intervention “add the truth vector” paper. I wonder why ClassMeans tends to do better?
Note that we use a disjoint training set from the one used to find the concept vector in order to learn the classifier. This ensures the removal doesn’t just obfuscate information making it hard for a classifier to learn, but actually scrubs it.
I might be missing something basic—can you explain this last part in a bit more detail? Why would using the same training set lead to “obfuscation” (and what does that mean?)
We check the coherence and sentiment of the generated completions in a quantitative manor.
We think logistic regression might be honing in on some spurious correlations that help with classification in that particular distribution but don’t have an effect on later layers of the model and thus its outputs. ClassMeans does as well as LEACE for removal as it has the same linear guardedness guarantee as LEACE as mentioned in their paper.
As for using a disjoint training set to train the post-removal classifier: We found that the linear classifier attained random accuracies if trained on the dataset used for removal, but higher accuracies when trained on a disjoint training set from the same distribution. One might think of this as the removal procedure ‘overfitting’ on its training data. We refer to it as ‘obfuscation’ in the post in the sense that its hard to learn a classifier from the original training data but there is still some information about the concept in the model that can be extracted with different training data. Thus, we believe the most rigorous thing to do is to use a separate training set to train the classifier after removal.
This is really nice work. I was excited to see that ClassMeans did so well, especially on removal—comparably to LEACE?! I would have expected LEACE to do significantly better.
Logistic direction steering also underwhelmed compared to (basically) ClassMeans in the inference-time intervention “add the truth vector” paper. I wonder why ClassMeans tends to do better?
I might be missing something basic—can you explain this last part in a bit more detail? Why would using the same training set lead to “obfuscation” (and what does that mean?)
Typo: “manor” → “manner”
Thx for the feedback. Fixed typo and added ITI reference.
Thank you, I am glad you liked our work!
We think logistic regression might be honing in on some spurious correlations that help with classification in that particular distribution but don’t have an effect on later layers of the model and thus its outputs. ClassMeans does as well as LEACE for removal as it has the same linear guardedness guarantee as LEACE as mentioned in their paper.
As for using a disjoint training set to train the post-removal classifier: We found that the linear classifier attained random accuracies if trained on the dataset used for removal, but higher accuracies when trained on a disjoint training set from the same distribution. One might think of this as the removal procedure ‘overfitting’ on its training data. We refer to it as ‘obfuscation’ in the post in the sense that its hard to learn a classifier from the original training data but there is still some information about the concept in the model that can be extracted with different training data. Thus, we believe the most rigorous thing to do is to use a separate training set to train the classifier after removal.