I think there’s tons of low-hanging fruit in toy model interpretability, and I expect at least some lessons from at least some such projects to generalize. A lot of the questions I’m excited about in interpretability are fundamentally accessible in toy models, like “how do models trade off interference and representational capacity?”, “what priors do MLP’s have over different hypotheses about the data distribution?”, etc.
I think there’s tons of low-hanging fruit in toy model interpretability, and I expect at least some lessons from at least some such projects to generalize. A lot of the questions I’m excited about in interpretability are fundamentally accessible in toy models, like “how do models trade off interference and representational capacity?”, “what priors do MLP’s have over different hypotheses about the data distribution?”, etc.