How about “Cognitive Interpretability”, or “AI Cognitive Psychology” (AI Cog Psych for short) rather than “Prosaic Interpretability”?
“Prosaic” conjures only some of the correct associations, and then only if you’ve heard of “Prosaic Alignment”, which was a pretty bad name imho. If you had told me to guess what you meant by the term PI, I would not have guessed what you have described.
I think MI, and what you call PI, are analogous to Cognitive Neuroscience and Cognitive Psychology, respectively, which is why I think AI Cog Psych will lead to more correct inferences on first hearing.
I also suspect that Cognitive Psychology, especially the linguistics part, already has a wealth of methods that could transfer very nicely onto LLMs. For example, in The Language Instinct, Steven Pinker describes how it is possible to discover many things about how we parse sentences without any brain scans—solely through natural language experiments. He mentioned a bunch of other experiments I think could work quite well, or could at least help build intuitions for how to discover mental mechanisms solely through input output behaviour on carefully constructed sentences.
It also sounds cooler to say you work on AI Cognitive Psychology rather than Prosaic Interpretability ;)
By the way, the analogy with genes is fantastic. I think it nicely points to the fact that even if some features are relatively straightforward to find, circuits may nevertheless be fiendishly difficult to uncover. Thanks for writing such an excellent post and being honest about some of your hard work that didn’t pan out how you hoped!
Atm I dislike ‘cog psych’ because it doesn’t evoke precise meaning for me and likely doesn’t for researchers w/o that background (which I guess is the majority). I do take the point that ‘prosaic’ may be a bad name though.
How about “Cognitive Interpretability”, or “AI Cognitive Psychology” (AI Cog Psych for short) rather than “Prosaic Interpretability”?
“Prosaic” conjures only some of the correct associations, and then only if you’ve heard of “Prosaic Alignment”, which was a pretty bad name imho. If you had told me to guess what you meant by the term PI, I would not have guessed what you have described.
I think MI, and what you call PI, are analogous to Cognitive Neuroscience and Cognitive Psychology, respectively, which is why I think AI Cog Psych will lead to more correct inferences on first hearing.
I also suspect that Cognitive Psychology, especially the linguistics part, already has a wealth of methods that could transfer very nicely onto LLMs. For example, in The Language Instinct, Steven Pinker describes how it is possible to discover many things about how we parse sentences without any brain scans—solely through natural language experiments. He mentioned a bunch of other experiments I think could work quite well, or could at least help build intuitions for how to discover mental mechanisms solely through input output behaviour on carefully constructed sentences.
It also sounds cooler to say you work on AI Cognitive Psychology rather than Prosaic Interpretability ;)
By the way, the analogy with genes is fantastic. I think it nicely points to the fact that even if some features are relatively straightforward to find, circuits may nevertheless be fiendishly difficult to uncover. Thanks for writing such an excellent post and being honest about some of your hard work that didn’t pan out how you hoped!
That book looks interesting! I’ll check it out
Atm I dislike ‘cog psych’ because it doesn’t evoke precise meaning for me and likely doesn’t for researchers w/o that background (which I guess is the majority). I do take the point that ‘prosaic’ may be a bad name though.
Glad you enjoyed the post!