A few thoughts on names, since that’s being raised by other comments:
There’s value in a research area having a name, since it helps people new to the field become aware that there’s something there to be looked into.
On the other hand, I worry that this area is too diffuse and too much of a grab bag to crystallize into a single subfield; see eg Rudolf’s point about it being debatable whether work like Owain Evans’s is really well-described as ‘interpretability’.
I think it’s also worth worrying that naming an area potentially distracts from what matters, namely finding the most effective approaches regardless of whether they fit comfortably into a particular subfield; it arguably creates a streetlight that makes it easier to see some things and harder to see others.
One potential downside to ‘prosaic interpretability’ is that it suggests ‘prosaic alignment’, which I’ve seen used to refer to techniques that only aim to handle current AI and will fail to help with future systems; it isn’t clear to me that that’s true of (all of) this area.
‘LLM Psychology’ and ‘Model Psychology’ have been used by Quentin Feuillade-Montixi and @NicholasKees to mean ‘studying LLM systems with a behavioral approach’, which seems like one aspect of this nascent area but by no means the whole. My impression is that ‘LLM psychology’ has unfortunately also been adopted by some of the people making unscientific claims about eg LLMs as spiritual entities on Twitter.
Others suggested below so far (for the sake of consolidation): ‘LLM behavioural science’, ‘science of evals’, ‘cognitive interpretability’, ‘AI cognitive psychology’
Great list! Going to add ‘empirical interpretability’, which attempts to induce the least number of unwanted associations while conveying the essence of what I imagine
A few thoughts on names, since that’s being raised by other comments:
There’s value in a research area having a name, since it helps people new to the field become aware that there’s something there to be looked into.
On the other hand, I worry that this area is too diffuse and too much of a grab bag to crystallize into a single subfield; see eg Rudolf’s point about it being debatable whether work like Owain Evans’s is really well-described as ‘interpretability’.
I think it’s also worth worrying that naming an area potentially distracts from what matters, namely finding the most effective approaches regardless of whether they fit comfortably into a particular subfield; it arguably creates a streetlight that makes it easier to see some things and harder to see others.
One potential downside to ‘prosaic interpretability’ is that it suggests ‘prosaic alignment’, which I’ve seen used to refer to techniques that only aim to handle current AI and will fail to help with future systems; it isn’t clear to me that that’s true of (all of) this area.
‘LLM Psychology’ and ‘Model Psychology’ have been used by Quentin Feuillade-Montixi and @NicholasKees to mean ‘studying LLM systems with a behavioral approach’, which seems like one aspect of this nascent area but by no means the whole. My impression is that ‘LLM psychology’ has unfortunately also been adopted by some of the people making unscientific claims about eg LLMs as spiritual entities on Twitter.
A few other terms that have been used by researchers to describe this and related areas include ‘high-level interpretability’ (both in Jozdien’s usage and more generally), ‘concept-based interpretability’ (or ‘conceptual interpretability’), ‘model-agnostic interpretability’.
Others suggested below so far (for the sake of consolidation): ‘LLM behavioural science’, ‘science of evals’, ‘cognitive interpretability’, ‘AI cognitive psychology’
Great list! Going to add ‘empirical interpretability’, which attempts to induce the least number of unwanted associations while conveying the essence of what I imagine