That you’re unaware of there being any notable counterfactual capabilities boost from interpretability is some update. How sure are you that you’d know if there were training multipliers that had interpretability strongly in their causal history? Are you not counting steering vectors from Anthropic here? And I didn’t find out about the Hyena from the news article, but from a friend who read the paper, the article just had a nicer quote.
I could imagine that interpretability being relatively ML flavoured makes it more appealing to scaling lab leadership, and this is the reason those projects get favoured rather than them seeing it as commercially useful, at least in many cases.
Would you expect that this continues as interpretability continues to get better? I’d be pretty surprised from general models to find that opening black boxes doesn’t let you debug them better, though I could imagine we’re not good enough at it yet.
SAE steering doesn’t seem like it obviously beats other steering techniques in terms of usefulness. I haven’t looked closely into Hyena but my prior is that subquadratic attention papers probably suck unless proven otherwise.
Interpretability is certainly vastly more appealing to lab leadership than weird philosophy, but it’s vastly less appealing than RLHF. But there are many many ML flavored directions and only a few of them are any good, so it’s not surprising that most directions don’t get a lot of attention.
Probably as interp gets better it will start to be helpful for capabilities. I’m uncertain whether it will be more or less useful for capabilities than just working on capabilities directly; on the one hand, mechanistic understanding has historically underperformed as a research strategy, on the other hand it could be that this will change once we have a sufficiently good mechanistic understanding.
That you’re unaware of there being any notable counterfactual capabilities boost from interpretability is some update. How sure are you that you’d know if there were training multipliers that had interpretability strongly in their causal history? Are you not counting steering vectors from Anthropic here? And I didn’t find out about the Hyena from the news article, but from a friend who read the paper, the article just had a nicer quote.
I could imagine that interpretability being relatively ML flavoured makes it more appealing to scaling lab leadership, and this is the reason those projects get favoured rather than them seeing it as commercially useful, at least in many cases.
Would you expect that this continues as interpretability continues to get better? I’d be pretty surprised from general models to find that opening black boxes doesn’t let you debug them better, though I could imagine we’re not good enough at it yet.
SAE steering doesn’t seem like it obviously beats other steering techniques in terms of usefulness. I haven’t looked closely into Hyena but my prior is that subquadratic attention papers probably suck unless proven otherwise.
Interpretability is certainly vastly more appealing to lab leadership than weird philosophy, but it’s vastly less appealing than RLHF. But there are many many ML flavored directions and only a few of them are any good, so it’s not surprising that most directions don’t get a lot of attention.
Probably as interp gets better it will start to be helpful for capabilities. I’m uncertain whether it will be more or less useful for capabilities than just working on capabilities directly; on the one hand, mechanistic understanding has historically underperformed as a research strategy, on the other hand it could be that this will change once we have a sufficiently good mechanistic understanding.
Are you talking about ML or in general? What are you deriving this from?
For ML, yes. I’m deriving this from the bitter lesson.