Thank you for sharing! I am also working on a write-up post for experiments I conducted with SAEs trained on Othello-GPT:) I’m using the original model by Kenneth Li et al., and mostly training SAEs with 512-1024 features. I also found that simple features such as my/their/empty are indeed rarely found in SAEs trained on later layers. However, there are more of them in SAEs trained on middle layers (including cells outside the “inner-ring”). In later layers, SAEs usually learn more complicated features, such as the combination of a few close cells being of a particular type. It makes sense because, by the last layer, all you care about are the complex features “is this move valid”, and simple features gradually become less relevant (and will probably have lesser norm). I’m hoping to share some more of my findings soon.
I just finished a training run of SAEs on the intermediate layer of my OthelloGPT. For me it seemed like the sweet spot was layers 2-3, and the SAE found up to 30 high-accuracy classifiers on Layer 3. They were located all in the “inner ring” and “outer ring”, with only one in the “middle ring”. (As before, I’m counting “high-accuracy” as AUROC>.9, which is an imperfect metric and threshold.)
Here were the full results. The numbers/colors indicate how many classes had a high-accuracy classifier for that position.
This is an interesting point—when we did our causality studies across layers, we also found that the board state features in the middle layers are mostly used causally—not the deep layers. However, the probe accuracy does increase with depth.
I don’t know how this translates to the fact that SAEs also find more of these features in the middle layers. Like, the “natural features” in some sense in the last few layers found by the SAEs do not have to contain much information about the board state but just partial information to make the decision.
Thank you for sharing! I am also working on a write-up post for experiments I conducted with SAEs trained on Othello-GPT:) I’m using the original model by Kenneth Li et al., and mostly training SAEs with 512-1024 features. I also found that simple features such as my/their/empty are indeed rarely found in SAEs trained on later layers. However, there are more of them in SAEs trained on middle layers (including cells outside the “inner-ring”). In later layers, SAEs usually learn more complicated features, such as the combination of a few close cells being of a particular type. It makes sense because, by the last layer, all you care about are the complex features “is this move valid”, and simple features gradually become less relevant (and will probably have lesser norm). I’m hoping to share some more of my findings soon.
[Continuing our conversation from messages]
I just finished a training run of SAEs on the intermediate layer of my OthelloGPT. For me it seemed like the sweet spot was layers 2-3, and the SAE found up to 30 high-accuracy classifiers on Layer 3. They were located all in the “inner ring” and “outer ring”, with only one in the “middle ring”. (As before, I’m counting “high-accuracy” as AUROC>.9, which is an imperfect metric and threshold.)
Here were the full results. The numbers/colors indicate how many classes had a high-accuracy classifier for that position.
This is an interesting point—when we did our causality studies across layers, we also found that the board state features in the middle layers are mostly used causally—not the deep layers. However, the probe accuracy does increase with depth.
I don’t know how this translates to the fact that SAEs also find more of these features in the middle layers. Like, the “natural features” in some sense in the last few layers found by the SAEs do not have to contain much information about the board state but just partial information to make the decision.