I don’t think anyone has, to date, used interpretability to make any major counterfactual contribution to capabilities. I would not rely on papers introducing a new technique to be the main piece of evidence as to whether the technique is actually good at all. (I would certainly not rely on news articles about papers—they are basically noise.)
I think you should take into account the fact that before there are really good concrete capabilities results, the process that different labs use to decide what to invest in is highly contingent on a bunch of high variance things. Like, what kinds of research directions appeal to research leadership, or whether there happen to be good ICs excited to work on that direction around and not tied down to any other project.
I don’t think you should be that surprised by interpretability being more popular than other areas of alignment. Certainly I think incentives towards capabilities is a small fraction of why it’s popular and funded etc (if anything, its non-usefulness for capabilities to date may count against it). Rather, I think it’s popular because it’s an area where you can actually get traction and do well-scoped projects and have a tight feedback loop. This is not true of the majority of alignment research directions that actually could help with aligning AGI/ASI, and correspondingly those directions are utterly soul grinding to work on.
One could very reasonably argue that more people should be figuring out how to work on the low traction, ill-scoped, shitty feedback loop research problems, and that the field is looking under the streetlight for the keys. I make this argument a lot. But I think you shouldn’t need to postulate some kind of nefarious capabilities incentive influence to explain it.
I don’t think anyone has, to date, used interpretability to make any major counterfactual contribution to capabilities. I would not rely on papers introducing a new technique to be the main piece of evidence as to whether the technique is actually good at all. (I would certainly not rely on news articles about papers—they are basically noise.)
I would bet against this on the basis that Chris Olah’s work was quite influential on a huge number of people, shaped their mental models of how Deep Learning works in general, and probably contributed to lots of improved capability-oriented thinking and decision-making.
Like, as a kind of related example where I expect it’s easier to find agreement, it’s hard to point to something concrete that “Linear Algebra Done Right” did to improve ML research, but I am quite confident it has had a non-trivial effect. It’s the favorite Linear Algebra textbook of many of the best contributors to the field, and having good models and explanations of the basics makes a big difference.
For the purposes of the original question of whether people are overinvesting in interp due to it being useful for capabilities and therefore being incentivized, I think there’s a pretty important distinction between direct usefulness and this sort of diffuse public good that is very hard to attribute. Things with large but diffuse impact are much more often underincentivized and often mostly done as a labor of love. In general, the more you think an organization is shaped by incentives that are hard to fight against, the more you should expect diffusely impactful things to be relatively neglected.
Separately, it’s also not clear to me that the diffuse intuitions from interpretability have actually helped people a lot with capabilities. Obviously this is very hard to attribute, and I can’t speak too much about details, but it feels to me like the most important intuitions come from elsewhere. What’s an example of an interpretability work that you feel has affected capabilities intuitions a lot?
I think there’s a pretty important distinction between direct usefulness and this sort of diffuse public good that is very hard to attribute. Things with large but diffuse impact are much more often underincentivized and often mostly done as a labor of love. In general, the more you think an organization is shaped by incentives that are hard to fight against, the more you should expect diffusely impactful things to be relatively neglected.
I have a model whereby ~all very successful large companies require a leader with vision, who is able to understand incentives and nonetheless take long-term action that isn’t locally rewarded. YC startups constantly talk about long-term investments into culture and hiring and onboarding processes that are costly in (I’d guess) the 3-12 month time-frame but extremely valuable in the 1-5 year time frame.
Saying that a system is heavily shaped by incentives doesn’t seem to me to imply that the system is heavily short-sighted. Companies like Amazon and Facebook are of course heavily shaped by incentives yet have quite long-term thinking in their leaders, who often do things that look like locally wasted effort because they have a vision of how it will pay off years down the line.
Speaking about the local political situation, I think safety investment from AI capabilities companies can be thought of as investing into problems that will come up in the future. As a more cynical hypothesis, I think it can also be usefully thought of as a worthwhile political ploy to attract talent and look responsible to regulators and intelligentsia.
(Added: Bottom-line: Following incentives does not mean short-sighted.)
I think there are a whole bunch of inputs that determine a company’s success. Research direction, management culture, engineering culture, product direction, etc. To be a really successful startup you often just need to have exceptional vision on one or a small number of these inputs, possibly even just once or twice. I’d guess it’s exceedingly rare for a company to have leaders with consistently great vision across all the inputs that go into a company. Everything else will constantly revert towards local incentives. So, even in a company with top 1 percentile leadership vision quality, most things will still be messed up because of incentives most of the time.
I very much agree that the focus on interpretability is like searching under the light. It’s legible; it’s a way to show that you’ve done something nontrivial—you did some real work on alignment. And it’s generally agreed that it’s progress toward alignment.
When people talk about prosaic alignment proposals, there’s a common pattern: they’ll be outlining some overcomplicated scheme, and then they’ll say “oh, and assume we have great interpretability tools, this whole thing just works way better the better the interpretability tools are”, and then they’ll go back to the overcomplicated scheme. (Credit to Evan for pointing out this pattern to me.)
But it’s not a way to solve alignment in itself. The idea that we’ll just understand and track all of the thoughts of a superintelligent AGI is just a strange idea. I really wonder how seriously people are thinking about the impact model of that work.
And they don’t need to, because it’s pretty obvious that better interp is incremental progress for a lot of AGI scenarios.
This is the incentive that makes progress in academia incredibly slow: there are incentives to do legibly impressive work. There are suprisingly few incentives to actually make progress on useful theories—because it’s harder to tell what would count as progress.
But if we’re all working on stuff with only small marginal payoffs, who’s working on actually getting beyond “overcomplicated schemes” and actually creating and working through practical, workable alignment plans?
I really wish some of the folks working on interp would devote a bit more of their time to “solving the whole problem”. It looks to me like we have a really dramatic misallocation of resources happening. We are searching under the light. We need more of us feeling around in the dark where we lost those keys.
That you’re unaware of there being any notable counterfactual capabilities boost from interpretability is some update. How sure are you that you’d know if there were training multipliers that had interpretability strongly in their causal history? Are you not counting steering vectors from Anthropic here? And I didn’t find out about the Hyena from the news article, but from a friend who read the paper, the article just had a nicer quote.
I could imagine that interpretability being relatively ML flavoured makes it more appealing to scaling lab leadership, and this is the reason those projects get favoured rather than them seeing it as commercially useful, at least in many cases.
Would you expect that this continues as interpretability continues to get better? I’d be pretty surprised from general models to find that opening black boxes doesn’t let you debug them better, though I could imagine we’re not good enough at it yet.
SAE steering doesn’t seem like it obviously beats other steering techniques in terms of usefulness. I haven’t looked closely into Hyena but my prior is that subquadratic attention papers probably suck unless proven otherwise.
Interpretability is certainly vastly more appealing to lab leadership than weird philosophy, but it’s vastly less appealing than RLHF. But there are many many ML flavored directions and only a few of them are any good, so it’s not surprising that most directions don’t get a lot of attention.
Probably as interp gets better it will start to be helpful for capabilities. I’m uncertain whether it will be more or less useful for capabilities than just working on capabilities directly; on the one hand, mechanistic understanding has historically underperformed as a research strategy, on the other hand it could be that this will change once we have a sufficiently good mechanistic understanding.
I don’t think anyone has, to date, used interpretability to make any major counterfactual contribution to capabilities. I would not rely on papers introducing a new technique to be the main piece of evidence as to whether the technique is actually good at all. (I would certainly not rely on news articles about papers—they are basically noise.)
I think you should take into account the fact that before there are really good concrete capabilities results, the process that different labs use to decide what to invest in is highly contingent on a bunch of high variance things. Like, what kinds of research directions appeal to research leadership, or whether there happen to be good ICs excited to work on that direction around and not tied down to any other project.
I don’t think you should be that surprised by interpretability being more popular than other areas of alignment. Certainly I think incentives towards capabilities is a small fraction of why it’s popular and funded etc (if anything, its non-usefulness for capabilities to date may count against it). Rather, I think it’s popular because it’s an area where you can actually get traction and do well-scoped projects and have a tight feedback loop. This is not true of the majority of alignment research directions that actually could help with aligning AGI/ASI, and correspondingly those directions are utterly soul grinding to work on.
One could very reasonably argue that more people should be figuring out how to work on the low traction, ill-scoped, shitty feedback loop research problems, and that the field is looking under the streetlight for the keys. I make this argument a lot. But I think you shouldn’t need to postulate some kind of nefarious capabilities incentive influence to explain it.
I would bet against this on the basis that Chris Olah’s work was quite influential on a huge number of people, shaped their mental models of how Deep Learning works in general, and probably contributed to lots of improved capability-oriented thinking and decision-making.
Like, as a kind of related example where I expect it’s easier to find agreement, it’s hard to point to something concrete that “Linear Algebra Done Right” did to improve ML research, but I am quite confident it has had a non-trivial effect. It’s the favorite Linear Algebra textbook of many of the best contributors to the field, and having good models and explanations of the basics makes a big difference.
For the purposes of the original question of whether people are overinvesting in interp due to it being useful for capabilities and therefore being incentivized, I think there’s a pretty important distinction between direct usefulness and this sort of diffuse public good that is very hard to attribute. Things with large but diffuse impact are much more often underincentivized and often mostly done as a labor of love. In general, the more you think an organization is shaped by incentives that are hard to fight against, the more you should expect diffusely impactful things to be relatively neglected.
Separately, it’s also not clear to me that the diffuse intuitions from interpretability have actually helped people a lot with capabilities. Obviously this is very hard to attribute, and I can’t speak too much about details, but it feels to me like the most important intuitions come from elsewhere. What’s an example of an interpretability work that you feel has affected capabilities intuitions a lot?
I have a model whereby ~all very successful large companies require a leader with vision, who is able to understand incentives and nonetheless take long-term action that isn’t locally rewarded. YC startups constantly talk about long-term investments into culture and hiring and onboarding processes that are costly in (I’d guess) the 3-12 month time-frame but extremely valuable in the 1-5 year time frame.
Saying that a system is heavily shaped by incentives doesn’t seem to me to imply that the system is heavily short-sighted. Companies like Amazon and Facebook are of course heavily shaped by incentives yet have quite long-term thinking in their leaders, who often do things that look like locally wasted effort because they have a vision of how it will pay off years down the line.
Speaking about the local political situation, I think safety investment from AI capabilities companies can be thought of as investing into problems that will come up in the future. As a more cynical hypothesis, I think it can also be usefully thought of as a worthwhile political ploy to attract talent and look responsible to regulators and intelligentsia.
(Added: Bottom-line: Following incentives does not mean short-sighted.)
I think there are a whole bunch of inputs that determine a company’s success. Research direction, management culture, engineering culture, product direction, etc. To be a really successful startup you often just need to have exceptional vision on one or a small number of these inputs, possibly even just once or twice. I’d guess it’s exceedingly rare for a company to have leaders with consistently great vision across all the inputs that go into a company. Everything else will constantly revert towards local incentives. So, even in a company with top 1 percentile leadership vision quality, most things will still be messed up because of incentives most of the time.
I very much agree that the focus on interpretability is like searching under the light. It’s legible; it’s a way to show that you’ve done something nontrivial—you did some real work on alignment. And it’s generally agreed that it’s progress toward alignment.
- Wentworth
But it’s not a way to solve alignment in itself. The idea that we’ll just understand and track all of the thoughts of a superintelligent AGI is just a strange idea. I really wonder how seriously people are thinking about the impact model of that work.
And they don’t need to, because it’s pretty obvious that better interp is incremental progress for a lot of AGI scenarios.
This is the incentive that makes progress in academia incredibly slow: there are incentives to do legibly impressive work. There are suprisingly few incentives to actually make progress on useful theories—because it’s harder to tell what would count as progress.
But if we’re all working on stuff with only small marginal payoffs, who’s working on actually getting beyond “overcomplicated schemes” and actually creating and working through practical, workable alignment plans?
I really wish some of the folks working on interp would devote a bit more of their time to “solving the whole problem”. It looks to me like we have a really dramatic misallocation of resources happening. We are searching under the light. We need more of us feeling around in the dark where we lost those keys.
That you’re unaware of there being any notable counterfactual capabilities boost from interpretability is some update. How sure are you that you’d know if there were training multipliers that had interpretability strongly in their causal history? Are you not counting steering vectors from Anthropic here? And I didn’t find out about the Hyena from the news article, but from a friend who read the paper, the article just had a nicer quote.
I could imagine that interpretability being relatively ML flavoured makes it more appealing to scaling lab leadership, and this is the reason those projects get favoured rather than them seeing it as commercially useful, at least in many cases.
Would you expect that this continues as interpretability continues to get better? I’d be pretty surprised from general models to find that opening black boxes doesn’t let you debug them better, though I could imagine we’re not good enough at it yet.
SAE steering doesn’t seem like it obviously beats other steering techniques in terms of usefulness. I haven’t looked closely into Hyena but my prior is that subquadratic attention papers probably suck unless proven otherwise.
Interpretability is certainly vastly more appealing to lab leadership than weird philosophy, but it’s vastly less appealing than RLHF. But there are many many ML flavored directions and only a few of them are any good, so it’s not surprising that most directions don’t get a lot of attention.
Probably as interp gets better it will start to be helpful for capabilities. I’m uncertain whether it will be more or less useful for capabilities than just working on capabilities directly; on the one hand, mechanistic understanding has historically underperformed as a research strategy, on the other hand it could be that this will change once we have a sufficiently good mechanistic understanding.
Are you talking about ML or in general? What are you deriving this from?
For ML, yes. I’m deriving this from the bitter lesson.