ryan_greenblatt comments on How useful is mechanistic interpretability?

ryan_greenblatt 2 Dec 2023 5:54 UTC
2 points
0
More generally, it seems good to be careful about thinking through questions like “does using X have a principled reason to be better than applying the ‘default’ approach (e.g. training a probe)”. Good to do this regardless of actually using the default approach so we know where the juice is coming from.

In the case of mech interp style decompositions, I’m pretty skeptical that there is any juice in finding your classifier by doing something like selecting over components rather than training a probe. But, there could theoretically be juice in trying to understand how a probe works by looking at its connections (and the connections of its connections etc).