Lee Sharkey comments on ‘Fundamental’ vs ‘applied’ mechanistic interpretability research

Lee Sharkey 26 May 2023 10:59 UTC
LW: 1 AF: 1
0
AF
Bilinear layers—not confident at all! It might make structure more amenable to mathematical analysis so it might help? But as yet there aren’t any empirical interpretability wins that have come from bilinear layers.

Dictionary learning—This is one of my main bets for comprehensive interpretability.

Other areas—I’m also generally excited by the line of research outlined in https://arxiv.org/abs/2301.04709