Daniel Tan comments on Daniel Tan’s Shortform

Daniel Tan 7 Aug 2024 8:31 UTC
1 point
0
[Repro] Circular Features in GPT-2 Small
This is a paper reproduction in service of achieving my seasonal goals
Recently, it was demonstrated that circular features are used in the computation of modular addition tasks in language models. I’ve reproduced this for GPT-2 small in this Colab.
We’ve confirmed that days of the week do appear to be represented in a circular fashion in the model. Furthermore, looking at feature dashboards agrees with the discovery; this suggests that simply looking up features that detect tokens in the same conceptual ‘category’ could be another way of finding clusters of features with interesting geometry.
Next steps:
1. Here, we’ve selected 9 SAE features, gotten the reconstruction, and then compressed this down via PCA. However, were all 9 features necessary? Could we remove some of them without hurting the visualization?
2. The SAE reconstruction using 9 features is probably a very small component of the model’s overall representation of this token. What’s in the rest of the representation? Is it mostly orthogonal to the SAE reconstruction, or is there a sizeable component remaining in this 9-dimensional subspace? If the latter, it would indicate that the SAE representation here is not a ‘full’ representation of the original model.
Thanks to Egg Syntax for pair programming and Josh Engels for help with the reproduction.