Seth Herd comments on A Rocket–Interpretability Analogy

Seth Herd 22 Oct 2024 1:13 UTC
17 points
14
I very much agree that the focus on interpretability is like searching under the light. It’s legible; it’s a way to show that you’ve done something nontrivial—you did some real work on alignment. And it’s generally agreed that it’s progress toward alignment.
When people talk about prosaic alignment proposals, there’s a common pattern: they’ll be outlining some overcomplicated scheme, and then they’ll say “oh, and assume we have great interpretability tools, this whole thing just works way better the better the interpretability tools are”, and then they’ll go back to the overcomplicated scheme. (Credit to Evan for pointing out this pattern to me.)
- Wentworth
But it’s not a way to solve alignment in itself. The idea that we’ll just understand and track all of the thoughts of a superintelligent AGI is just a strange idea. I really wonder how seriously people are thinking about the impact model of that work.
And they don’t need to, because it’s pretty obvious that better interp is incremental progress for a lot of AGI scenarios.
This is the incentive that makes progress in academia incredibly slow: there are incentives to do legibly impressive work. There are suprisingly few incentives to actually make progress on useful theories—because it’s harder to tell what would count as progress.
But if we’re all working on stuff with only small marginal payoffs, who’s working on actually getting beyond “overcomplicated schemes” and actually creating and working through practical, workable alignment plans?
I really wish some of the folks working on interp would devote a bit more of their time to “solving the whole problem”. It looks to me like we have a really dramatic misallocation of resources happening. We are searching under the light. We need more of us feeling around in the dark where we lost those keys.