Zac Hatfield-Dodds comments on Is there a list of projects to get started with Interpretability?

Zac Hatfield-Dodds 7 Sep 2022 7:15 UTC
6 points
0
I’d love to see some replications of Anthropic’s Induction Heads paper—it’s based around models small enough to train on a single machine (and reasonable budget for students!), related to cutting-edge interpretability, and has an explicit “Unexplained Curiosities” section listing weird things to investigate in future work.

For readers not focussed on interpretability, I’d note that ‘scaling laws go down as well as up’ - you can do relevant work even on very small models, if you design the experiment well. Two I’d love to see are a replication of BERTs of a feather do not generalize together; and some experiments on arithmetic as a small proxy for code models (c.f. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets) where you can investigate scaling laws for generalization across fraction of data, number of terms, number of digits, fine-tuning required to add new operators, whether this changes with architectures, etc etc.

(opinions my own, not speaking for my employer, you know the drill.)