Hi! I recently trained a suite of models ranging from 19M to 13B parameters with the goal of promoting research on LLM interpretability. I think it would be awesome to try out these experiments on the model suite and look at how the results change as the models scale. If your code used the HF transformers library it should work more or less out of the box with my new model suite.
Overall, there doesn’t seem to be any clear trend on what I’ve tried. Maybe it would be clearer if I had larger benchmarks. I’m currently working on finding a good large one, tell me if you have any idea.
The logit lens direction (she-he) seems to work on average slightly better in smaller models. Larger models can exhibit transitions between regions where the causal directions changes radically.
I’m surprised that even small model generalize as well as larger ones on French.
All experiments are one gender. Layer number are given as a fraction of total number of layers. “mean diff” is the direction corresponding to the difference of means between positive and negative labels, which in practice is pretty close to RLACE while being extremely cheap to compute.
Hi! I recently trained a suite of models ranging from 19M to 13B parameters with the goal of promoting research on LLM interpretability. I think it would be awesome to try out these experiments on the model suite and look at how the results change as the models scale. If your code used the HF transformers library it should work more or less out of the box with my new model suite.
You can find out more here: https://twitter.com/AiEleuther/status/1603755161893085184?s=20&t=6xkBsYckPcNZEYG8cDD6Ag
I launched some experiments. I’ll keep you updated.
Overall, there doesn’t seem to be any clear trend on what I’ve tried. Maybe it would be clearer if I had larger benchmarks. I’m currently working on finding a good large one, tell me if you have any idea.
The logit lens direction (she-he) seems to work on average slightly better in smaller models. Larger models can exhibit transitions between regions where the causal directions changes radically.
I’m surprised that even small model generalize as well as larger ones on French.
All experiments are one gender. Layer number are given as a fraction of total number of layers. “mean diff” is the direction corresponding to the difference of means between positive and negative labels, which in practice is pretty close to RLACE while being extremely cheap to compute.