Overall, there doesn’t seem to be any clear trend on what I’ve tried. Maybe it would be clearer if I had larger benchmarks. I’m currently working on finding a good large one, tell me if you have any idea.
The logit lens direction (she-he) seems to work on average slightly better in smaller models. Larger models can exhibit transitions between regions where the causal directions changes radically.
I’m surprised that even small model generalize as well as larger ones on French.
All experiments are one gender. Layer number are given as a fraction of total number of layers. “mean diff” is the direction corresponding to the difference of means between positive and negative labels, which in practice is pretty close to RLACE while being extremely cheap to compute.
Overall, there doesn’t seem to be any clear trend on what I’ve tried. Maybe it would be clearer if I had larger benchmarks. I’m currently working on finding a good large one, tell me if you have any idea.
The logit lens direction (she-he) seems to work on average slightly better in smaller models. Larger models can exhibit transitions between regions where the causal directions changes radically.
I’m surprised that even small model generalize as well as larger ones on French.
All experiments are one gender. Layer number are given as a fraction of total number of layers. “mean diff” is the direction corresponding to the difference of means between positive and negative labels, which in practice is pretty close to RLACE while being extremely cheap to compute.