I agree that using a linear classifier to find a concept can go terribly wrong, and I think that this post shows that it does go wrong. But I think that how it goes wrong can be informative (I hope did not fail too badly to apply the second law of experiment design!).
Here, the classifier is able to classify labels almost perfectly, so it’s not learning only about outliers. But what is measured is a correlation between activations and labels, not a causal story explanation of how the model uses activations, so it doesn’t mean that the classifier found “a concept of gender” the model actually uses. And indeed, if you completely remove the direction found by the classifier, the model is still able to “use” the concept of gender: its behavior has almost not changed.
RLACE has a much more significant impact on model behavior, which seems somewhat related to gender, but I wouldn’t bet it has found “the concept of gender”, for the same reasons as above.
Still, I think that all of this is not completely useless to understand what’s happening in the network (for example, the network is using large features, not “crisp” ones), and is mildly informative for future experiment design.
Here, the classifier is able to classify labels almost perfectly, so it’s not learning only about outliers.
If there’s one near-perfect separating hyperplane, then there’s usually lots of near-perfect separating hyperplanes; which one is chosen by the linear classifier is determined mostly by outliers/borderline cases. That’s what I mean when I say it’s mostly measuring outliers.
I agree that using a linear classifier to find a concept can go terribly wrong, and I think that this post shows that it does go wrong. But I think that how it goes wrong can be informative (I hope did not fail too badly to apply the second law of experiment design!).
Here, the classifier is able to classify labels almost perfectly, so it’s not learning only about outliers. But what is measured is a correlation between activations and labels, not a causal story explanation of how the model uses activations, so it doesn’t mean that the classifier found “a concept of gender” the model actually uses. And indeed, if you completely remove the direction found by the classifier, the model is still able to “use” the concept of gender: its behavior has almost not changed.
RLACE has a much more significant impact on model behavior, which seems somewhat related to gender, but I wouldn’t bet it has found “the concept of gender”, for the same reasons as above.
Still, I think that all of this is not completely useless to understand what’s happening in the network (for example, the network is using large features, not “crisp” ones), and is mildly informative for future experiment design.
If there’s one near-perfect separating hyperplane, then there’s usually lots of near-perfect separating hyperplanes; which one is chosen by the linear classifier is determined mostly by outliers/borderline cases. That’s what I mean when I say it’s mostly measuring outliers.