shash42 comments on Evaluating hidden directions on the utility dataset: classification, steering and removal

shash42 25 Sep 2023 21:39 UTC
1 point
0
Thank you, I am glad you liked our work!

We think logistic regression might be honing in on some spurious correlations that help with classification in that particular distribution but don’t have an effect on later layers of the model and thus its outputs. ClassMeans does as well as LEACE for removal as it has the same linear guardedness guarantee as LEACE as mentioned in their paper.
As for using a disjoint training set to train the post-removal classifier: We found that the linear classifier attained random accuracies if trained on the dataset used for removal, but higher accuracies when trained on a disjoint training set from the same distribution. One might think of this as the removal procedure ‘overfitting’ on its training data. We refer to it as ‘obfuscation’ in the post in the sense that its hard to learn a classifier from the original training data but there is still some information about the concept in the model that can be extracted with different training data. Thus, we believe the most rigorous thing to do is to use a separate training set to train the classifier after removal.