A natural thing to do is to have a bunch of sentences about guys and girls, train a linear classifier to predict if the sentence is about guys or girls, and use the direction the linear classifier gives you as the “direction corresponding to the concept of gender”.
My knee-jerk response to this is You Are Not Measuring What You Think You are Measuring; I would not expect this to find a robust representation of the “concept of gender” in the network, even in situations where the network does have a clearly delineated internal representation of the concept of gender. There’s all sorts of ways a linear classifier might fail to find the intended concept (even assuming that gender is represented by a direction in activation-space at all). My modal guess would be that your linear classifier mostly measured outliers/borderline labels.
I agree that using a linear classifier to find a concept can go terribly wrong, and I think that this post shows that it does go wrong. But I think that how it goes wrong can be informative (I hope did not fail too badly to apply the second law of experiment design!).
Here, the classifier is able to classify labels almost perfectly, so it’s not learning only about outliers. But what is measured is a correlation between activations and labels, not a causal story explanation of how the model uses activations, so it doesn’t mean that the classifier found “a concept of gender” the model actually uses. And indeed, if you completely remove the direction found by the classifier, the model is still able to “use” the concept of gender: its behavior has almost not changed.
RLACE has a much more significant impact on model behavior, which seems somewhat related to gender, but I wouldn’t bet it has found “the concept of gender”, for the same reasons as above.
Still, I think that all of this is not completely useless to understand what’s happening in the network (for example, the network is using large features, not “crisp” ones), and is mildly informative for future experiment design.
Here, the classifier is able to classify labels almost perfectly, so it’s not learning only about outliers.
If there’s one near-perfect separating hyperplane, then there’s usually lots of near-perfect separating hyperplanes; which one is chosen by the linear classifier is determined mostly by outliers/borderline cases. That’s what I mean when I say it’s mostly measuring outliers.
My knee-jerk response to this is You Are Not Measuring What You Think You are Measuring; I would not expect this to find a robust representation of the “concept of gender” in the network, even in situations where the network does have a clearly delineated internal representation of the concept of gender. There’s all sorts of ways a linear classifier might fail to find the intended concept (even assuming that gender is represented by a direction in activation-space at all). My modal guess would be that your linear classifier mostly measured outliers/borderline labels.
I agree that using a linear classifier to find a concept can go terribly wrong, and I think that this post shows that it does go wrong. But I think that how it goes wrong can be informative (I hope did not fail too badly to apply the second law of experiment design!).
Here, the classifier is able to classify labels almost perfectly, so it’s not learning only about outliers. But what is measured is a correlation between activations and labels, not a causal story explanation of how the model uses activations, so it doesn’t mean that the classifier found “a concept of gender” the model actually uses. And indeed, if you completely remove the direction found by the classifier, the model is still able to “use” the concept of gender: its behavior has almost not changed.
RLACE has a much more significant impact on model behavior, which seems somewhat related to gender, but I wouldn’t bet it has found “the concept of gender”, for the same reasons as above.
Still, I think that all of this is not completely useless to understand what’s happening in the network (for example, the network is using large features, not “crisp” ones), and is mildly informative for future experiment design.
If there’s one near-perfect separating hyperplane, then there’s usually lots of near-perfect separating hyperplanes; which one is chosen by the linear classifier is determined mostly by outliers/borderline cases. That’s what I mean when I say it’s mostly measuring outliers.