L Rudolf L comments on Why I’m Moving from Mechanistic to Prosaic Interpretability

L Rudolf L 31 Dec 2024 12:05 UTC
11 points
2
Great post! I’m also a big (though biased) fan of Owain’s research agenda, and share your concerns with mech interp.
I’m therefore coining the term “prosaic interpretability”—an approach to understanding model internals [...]
Concretely, I’ve been really impressed by work like Owain Evans’ research on the Reversal Curse, Two-Hop Curse, and Connecting the Dots^[3]. These feel like they’re telling us something real, general, and fundamental about how language models think. Despite being primarily empirical, such work is well-formulated conceptually, and yields gearsy mental models of neural nets, independently of existing paradigms.
[emphasis added]
I don’t understand how the papers mentioned are about understanding model internals, and as a result I find the term “prosaic interpretability” confusing.
Some points that are relevant in my thinking (stealing a digram from an unpublished draft of mine):
- the only thing we fundamentally care about with LLMs is the input-output behaviour (I-O)
- now often, a good way to study the I-O map is to first understand the internals M
- but if understanding the internals M is hard but you can make useful generalising statements about the I-O, then you might as well skip dealing with M at all (c.f. psychology, lots of econ, LLM papers like this)
- the Owain papers you mention seem to me to make 3 distinct types of moves, in this taxonomy:
  - finding some useful generalising statement about the I-O map behaviour (potentially conditional on some property of the training data) (e.g. the reversal curse)
  - creating a modified model M’ from M via fine-tuning on same data (but again, not caring about what the data actually does to the internals)
  - (far less centrally than the above!) speculating about what the internal structure that causes the behavioural patterns above might be (e.g. that maybe models trained on “A=B” learn to map representation(A) --> representation(B) in some MLP, instead of learning the general rule that A and B are the same thing and representing them internally as such)
So overall, I don’t think the type of work you mention is really focused on internals or interpretability at all, except incidentally in minor ways. (There’s perhaps a similar vibe difference here to category theory v set theory: the focus being relations between (black-boxed) objects, versus the focus being the internals/contents of objects, with relations and operations defined by what they do to those internals)
I think thinking about internals can be useful—see here for a Neel Nanda tweet arguing the reversal curse if obvious if you understand mech interp—but also the blackbox research often has a different conceptual frame, and is often powerful specifically when it can skip all theorising about internals while still bringing true generalising statements about models to the table.
And therefore I’d suggest a different name than “prosaic interpretability”. “LLM behavioural science”? “Science of evals”? “Model psychology”? (Though I don’t particularly like any of these terms)
- Daniel Tan 31 Dec 2024 12:44 UTC
  3 points
  0
  Parent
  Glad you enjoyed it!
  the Owain papers you mention seem to me to make 3 distinct types of moves, in this taxonomy:
  finding some useful generalising statement about the I-O map behaviour (potentially conditional on some property of the training data) (e.g. the reversal curse)
  creating a modified model M’ from M via fine-tuning on same data (but again, not caring about what the data actually does to the internals)
  (far less centrally than the above!) speculating about what the internal structure that causes the behavioural patterns above might be (e.g. that maybe models trained on “A=B” learn to map representation(A) --> representation(B) in some MLP, instead of learning the general rule that A and B are the same thing and representing them internally as such)
  Found this taxonomy pretty insightful, thanks! Note that “creating a modified model M’ from M” is something that has obvious parallels to mechanistic interpretability (e.g. this is what happens when we do any form of activation patching, steering, etc). Mechanistic interpretability also often starts from a “useful generalising statement”, e.g. in IOI there’s a clear way to programmatically infer the output from the input. Other ‘classic’ mech interp circuits start from similarly formulaic data. I think the similarities are pretty striking even if you don’t care to dig further
  I don’t understand how the papers mentioned are about understanding model internals, and as a result I find the term “prosaic interpretability” confusing.
  I agree in hindsight that ‘model internals’ is a misnomer. As you say, what we actually really care about is functional understanding (in terms of the input-output map), not representational understanding (in terms of the components of the model), and in future writing I’ll re-frame the goal as such. I still argue that having a good functional understanding falls under the broader umbrella of ‘interpretability’, e.g. training a smaller, more interpretable model that locally predicts the larger model is something that has been historically called ‘interpretability’ / ‘explainability’.
  I also like keeping ‘interpretability’ in the name somewhere, if only to make clear that the aspirational goals and theory of change are likely very similar to mechanistic interpretability
  So overall, I don’t think the type of work you mention is really focused on internals
  I agree with this! At the same time I think this type of work will be the backbone of the (very under-specified and nascent) agenda of ‘prosaic interpretability’. In my current vision, ‘prosaic interpretability’ will stitch together a swathe of different empirical results into a broader fabric of ‘LLM behavioural science’. This will in turn yield gearsy models like Jan Kulveit’s three layer model of LLM behavioural science or the Waluigi effect.
  Aside: I haven’t been working with Owain very long but it’s already my impression that in order to do the sort of work his group does, it’s necessary to have a collection of (probably somewhat-wrong but still useful) gearsy models like the above that point the way to interesting experiments to do. Regrettably these intuitions are seldom conveyed in formal writing because they are less defensible than the object-level claims. Nonetheless I think they are a very fundamental (and understated) part of the work.