ryan_greenblatt comments on How useful is mechanistic interpretability?

ryan_greenblatt 2 Feb 2024 16:24 UTC
3 points
1
On (1), I agree, if you could explain 80% of GPT-4 performance on a task and metric where GPT-3.5 performs ¹⁄₂ as well as GPT-4 than that would suffice for showing something interesting not in GPT-3.5. For instance, if an explanation was able to human interpretably explain 80% of GPT-4′s accuracy on solving APPS programing problems, then that accuracy would be higher than GPT-3.5.

However, I expect that performance on these sorts of tasks is pretty sensitive such that getting 80% of performance is much harder than getting 80% of loss recovered on web text. Most prior results look at explaning loss on webtext or a narrow distribution of webtext, not on trying to preserve downstream performance on some task.

There are some reasons why it could be easier to explain a high fraction of training compute in downstream task performance (e.g. it’s a task that humans can do as well as models), but also some annoyances related to only having a smaller amount of data.

I’m skeptical that (2) will qualitatively matter much, but I can see the intuition.