I don’t think anyone has, to date, used interpretability to make any major counterfactual contribution to capabilities. I would not rely on papers introducing a new technique to be the main piece of evidence as to whether the technique is actually good at all. (I would certainly not rely on news articles about papers—they are basically noise.)
I would bet against this on the basis that Chris Olah’s work was quite influential on a huge number of people, shaped their mental models of how Deep Learning works in general, and probably contributed to lots of improved capability-oriented thinking and decision-making.
Like, as a kind of related example where I expect it’s easier to find agreement, it’s hard to point to something concrete that “Linear Algebra Done Right” did to improve ML research, but I am quite confident it has had a non-trivial effect. It’s the favorite Linear Algebra textbook of many of the best contributors to the field, and having good models and explanations of the basics makes a big difference.
For the purposes of the original question of whether people are overinvesting in interp due to it being useful for capabilities and therefore being incentivized, I think there’s a pretty important distinction between direct usefulness and this sort of diffuse public good that is very hard to attribute. Things with large but diffuse impact are much more often underincentivized and often mostly done as a labor of love. In general, the more you think an organization is shaped by incentives that are hard to fight against, the more you should expect diffusely impactful things to be relatively neglected.
Separately, it’s also not clear to me that the diffuse intuitions from interpretability have actually helped people a lot with capabilities. Obviously this is very hard to attribute, and I can’t speak too much about details, but it feels to me like the most important intuitions come from elsewhere. What’s an example of an interpretability work that you feel has affected capabilities intuitions a lot?
I think there’s a pretty important distinction between direct usefulness and this sort of diffuse public good that is very hard to attribute. Things with large but diffuse impact are much more often underincentivized and often mostly done as a labor of love. In general, the more you think an organization is shaped by incentives that are hard to fight against, the more you should expect diffusely impactful things to be relatively neglected.
I have a model whereby ~all very successful large companies require a leader with vision, who is able to understand incentives and nonetheless take long-term action that isn’t locally rewarded. YC startups constantly talk about long-term investments into culture and hiring and onboarding processes that are costly in (I’d guess) the 3-12 month time-frame but extremely valuable in the 1-5 year time frame.
Saying that a system is heavily shaped by incentives doesn’t seem to me to imply that the system is heavily short-sighted. Companies like Amazon and Facebook are of course heavily shaped by incentives yet have quite long-term thinking in their leaders, who often do things that look like locally wasted effort because they have a vision of how it will pay off years down the line.
Speaking about the local political situation, I think safety investment from AI capabilities companies can be thought of as investing into problems that will come up in the future. As a more cynical hypothesis, I think it can also be usefully thought of as a worthwhile political ploy to attract talent and look responsible to regulators and intelligentsia.
(Added: Bottom-line: Following incentives does not mean short-sighted.)
I think there are a whole bunch of inputs that determine a company’s success. Research direction, management culture, engineering culture, product direction, etc. To be a really successful startup you often just need to have exceptional vision on one or a small number of these inputs, possibly even just once or twice. I’d guess it’s exceedingly rare for a company to have leaders with consistently great vision across all the inputs that go into a company. Everything else will constantly revert towards local incentives. So, even in a company with top 1 percentile leadership vision quality, most things will still be messed up because of incentives most of the time.
I would bet against this on the basis that Chris Olah’s work was quite influential on a huge number of people, shaped their mental models of how Deep Learning works in general, and probably contributed to lots of improved capability-oriented thinking and decision-making.
Like, as a kind of related example where I expect it’s easier to find agreement, it’s hard to point to something concrete that “Linear Algebra Done Right” did to improve ML research, but I am quite confident it has had a non-trivial effect. It’s the favorite Linear Algebra textbook of many of the best contributors to the field, and having good models and explanations of the basics makes a big difference.
For the purposes of the original question of whether people are overinvesting in interp due to it being useful for capabilities and therefore being incentivized, I think there’s a pretty important distinction between direct usefulness and this sort of diffuse public good that is very hard to attribute. Things with large but diffuse impact are much more often underincentivized and often mostly done as a labor of love. In general, the more you think an organization is shaped by incentives that are hard to fight against, the more you should expect diffusely impactful things to be relatively neglected.
Separately, it’s also not clear to me that the diffuse intuitions from interpretability have actually helped people a lot with capabilities. Obviously this is very hard to attribute, and I can’t speak too much about details, but it feels to me like the most important intuitions come from elsewhere. What’s an example of an interpretability work that you feel has affected capabilities intuitions a lot?
I have a model whereby ~all very successful large companies require a leader with vision, who is able to understand incentives and nonetheless take long-term action that isn’t locally rewarded. YC startups constantly talk about long-term investments into culture and hiring and onboarding processes that are costly in (I’d guess) the 3-12 month time-frame but extremely valuable in the 1-5 year time frame.
Saying that a system is heavily shaped by incentives doesn’t seem to me to imply that the system is heavily short-sighted. Companies like Amazon and Facebook are of course heavily shaped by incentives yet have quite long-term thinking in their leaders, who often do things that look like locally wasted effort because they have a vision of how it will pay off years down the line.
Speaking about the local political situation, I think safety investment from AI capabilities companies can be thought of as investing into problems that will come up in the future. As a more cynical hypothesis, I think it can also be usefully thought of as a worthwhile political ploy to attract talent and look responsible to regulators and intelligentsia.
(Added: Bottom-line: Following incentives does not mean short-sighted.)
I think there are a whole bunch of inputs that determine a company’s success. Research direction, management culture, engineering culture, product direction, etc. To be a really successful startup you often just need to have exceptional vision on one or a small number of these inputs, possibly even just once or twice. I’d guess it’s exceedingly rare for a company to have leaders with consistently great vision across all the inputs that go into a company. Everything else will constantly revert towards local incentives. So, even in a company with top 1 percentile leadership vision quality, most things will still be messed up because of incentives most of the time.