The frequently accompanying, action-relevant claim—that substantially easier-to-interpret alternatives exist—is probably false and distracts people with fake options. That’s my main thesis.
I agree with this claim (anything inherently interpretable in the conventional seems totally doomed). I do want to push back on an implicit vibe of “these models are hard to interpret because of the domain, not because of the structure” though—interpretability is really fucking hard! It’s possible, but these models are weird and cursed and rife with bullshit like superposition, it’s hard to figure out how to break them down into interpretable units, they’re full of illusions, and confused and janky representations, etc.
I don’t really have a coherent and rigorous argument here, I just want to push back on the vibe I got from your “interpretability has many wins” section—it’s really hard!
I agree with this claim (anything inherently interpretable in the conventional seems totally doomed). I do want to push back on an implicit vibe of “these models are hard to interpret because of the domain, not because of the structure” though—interpretability is really fucking hard! It’s possible, but these models are weird and cursed and rife with bullshit like superposition, it’s hard to figure out how to break them down into interpretable units, they’re full of illusions, and confused and janky representations, etc.
I don’t really have a coherent and rigorous argument here, I just want to push back on the vibe I got from your “interpretability has many wins” section—it’s really hard!
Yeah, you’re just right about vibes.
I was trying to give “possible but hard” vibes, and the end result just tilts too far one way and doesn’t speak enough about concrete difficulties.