Daniel Tan comments on Why I’m Moving from Mechanistic to Prosaic Interpretability

Daniel Tan 31 Dec 2024 12:44 UTC
3 points
0
Glad you enjoyed it!
the Owain papers you mention seem to me to make 3 distinct types of moves, in this taxonomy:
- finding some useful generalising statement about the I-O map behaviour (potentially conditional on some property of the training data) (e.g. the reversal curse)
- creating a modified model M’ from M via fine-tuning on same data (but again, not caring about what the data actually does to the internals)
- (far less centrally than the above!) speculating about what the internal structure that causes the behavioural patterns above might be (e.g. that maybe models trained on “A=B” learn to map representation(A) --> representation(B) in some MLP, instead of learning the general rule that A and B are the same thing and representing them internally as such)
Found this taxonomy pretty insightful, thanks! Note that “creating a modified model M’ from M” is something that has obvious parallels to mechanistic interpretability (e.g. this is what happens when we do any form of activation patching, steering, etc). Mechanistic interpretability also often starts from a “useful generalising statement”, e.g. in IOI there’s a clear way to programmatically infer the output from the input. Other ‘classic’ mech interp circuits start from similarly formulaic data. I think the similarities are pretty striking even if you don’t care to dig further
I don’t understand how the papers mentioned are about understanding model internals, and as a result I find the term “prosaic interpretability” confusing.
I agree in hindsight that ‘model internals’ is a misnomer. As you say, what we actually really care about is functional understanding (in terms of the input-output map), not representational understanding (in terms of the components of the model), and in future writing I’ll re-frame the goal as such. I still argue that having a good functional understanding falls under the broader umbrella of ‘interpretability’, e.g. training a smaller, more interpretable model that locally predicts the larger model is something that has been historically called ‘interpretability’ / ‘explainability’.
I also like keeping ‘interpretability’ in the name somewhere, if only to make clear that the aspirational goals and theory of change are likely very similar to mechanistic interpretability
So overall, I don’t think the type of work you mention is really focused on internals
I agree with this! At the same time I think this type of work will be the backbone of the (very under-specified and nascent) agenda of ‘prosaic interpretability’. In my current vision, ‘prosaic interpretability’ will stitch together a swathe of different empirical results into a broader fabric of ‘LLM behavioural science’. This will in turn yield gearsy models like Jan Kulveit’s three layer model of LLM behavioural science or the Waluigi effect.
Aside: I haven’t been working with Owain very long but it’s already my impression that in order to do the sort of work his group does, it’s necessary to have a collection of (probably somewhat-wrong but still useful) gearsy models like the above that point the way to interesting experiments to do. Regrettably these intuitions are seldom conveyed in formal writing because they are less defensible than the object-level claims. Nonetheless I think they are a very fundamental (and understated) part of the work.