eggsyntax comments on Why I’m Moving from Mechanistic to Prosaic Interpretability

eggsyntax Dec 31, 2024, 6:17 PM
15 points
0
A few thoughts on names, since that’s being raised by other comments:
- There’s value in a research area having a name, since it helps people new to the field become aware that there’s something there to be looked into.
- On the other hand, I worry that this area is too diffuse and too much of a grab bag to crystallize into a single subfield; see eg Rudolf’s point about it being debatable whether work like Owain Evans’s is really well-described as ‘interpretability’.
- I think it’s also worth worrying that naming an area potentially distracts from what matters, namely finding the most effective approaches regardless of whether they fit comfortably into a particular subfield; it arguably creates a streetlight that makes it easier to see some things and harder to see others.
- One potential downside to ‘prosaic interpretability’ is that it suggests ‘prosaic alignment’, which I’ve seen used to refer to techniques that only aim to handle current AI and will fail to help with future systems; it isn’t clear to me that that’s true of (all of) this area.
- ‘LLM Psychology’ and ‘Model Psychology’ have been used by Quentin Feuillade-Montixi and @NicholasKees to mean ‘studying LLM systems with a behavioral approach’, which seems like one aspect of this nascent area but by no means the whole. My impression is that ‘LLM psychology’ has unfortunately also been adopted by some of the people making unscientific claims about eg LLMs as spiritual entities on Twitter.
- A few other terms that have been used by researchers to describe this and related areas include ‘high-level interpretability’ (both in Jozdien’s usage and more generally), ‘concept-based interpretability’ (or ‘conceptual interpretability’), ‘model-agnostic interpretability’.
- Others suggested below so far (for the sake of consolidation): ‘LLM behavioural science’, ‘science of evals’, ‘cognitive interpretability’, ‘AI cognitive psychology’
- Daniel Tan Dec 31, 2024, 7:02 PM
  3 points
  1
  Parent
  Great list! Going to add ‘empirical interpretability’, which attempts to induce the least number of unwanted associations while conveying the essence of what I imagine

Keyboard shortcuts

Keys shown in yellow (e.g., ]) are accesskeys, and require a browser-specific modifier key (or keys).

Keys shown in grey (e.g., ?) do not require any modifier keys.

General
? Show keyboard shortcuts
Esc Hide keyboard shortcuts

Site navigation
h Go to Home (a.k.a. “Frontpage”) view
f Go to Featured (a.k.a. “Curated”) view
a Go to All (a.k.a. “Community”) view
m Go to Meta view
v Go to Tags view
c Go to Recent Comments view
r Go to Archive view
q Go to Sequences view
t Go to About page
u Go to User or Login page
o Go to Inbox page

Page navigation
, Jump up to top of page
. Jump down to bottom of page
/ Jump to top of comments section
s Search

Page actions
n New post or comment
e Edit current post

Post/comment list views
. Focus next entry in list
, Focus previous entry in list
; Cycle between links in focused entry
Enter Go to currently focused entry
Esc Unfocus currently focused entry
] Go to next page
[ Go to previous page
\ Go to first page
e Edit currently focused post

Editor
k Bold text
i Italic text
l Insert hyperlink
q Blockquote text

Appearance
= Increase text size
- Decrease text size
0 Reset to default text size
′ Cycle through content width settings
1 Switch to default theme [A]
2 Switch to dark theme [B]
3 Switch to grey theme [C]
4 Switch to ultramodern theme [D]
5 Switch to simple theme [E]
6 Switch to brutalist theme [F]
7 Switch to ReadTheSequences theme [G]
8 Switch to classic Less Wrong theme [H]
9 Switch to modern Less Wrong theme [I]
; Open theme tweaker
Enter Save changes and close theme tweaker
Esc Close theme tweaker (without saving)

Slide shows
l Start/resume slideshow
Esc Exit slideshow
→↓ Next slide
←↑ Previous slide
Space Reset slide zoom

Miscellaneous
x Switch to next view on user page
z Switch to previous view on user page
` Toggle compact comment list view
g Toggle anti-kibitzer