For the purposes of the original question of whether people are overinvesting in interp due to it being useful for capabilities and therefore being incentivized, I think there’s a pretty important distinction between direct usefulness and this sort of diffuse public good that is very hard to attribute. Things with large but diffuse impact are much more often underincentivized and often mostly done as a labor of love. In general, the more you think an organization is shaped by incentives that are hard to fight against, the more you should expect diffusely impactful things to be relatively neglected.
Separately, it’s also not clear to me that the diffuse intuitions from interpretability have actually helped people a lot with capabilities. Obviously this is very hard to attribute, and I can’t speak too much about details, but it feels to me like the most important intuitions come from elsewhere. What’s an example of an interpretability work that you feel has affected capabilities intuitions a lot?
I think there’s a pretty important distinction between direct usefulness and this sort of diffuse public good that is very hard to attribute. Things with large but diffuse impact are much more often underincentivized and often mostly done as a labor of love. In general, the more you think an organization is shaped by incentives that are hard to fight against, the more you should expect diffusely impactful things to be relatively neglected.
I have a model whereby ~all very successful large companies require a leader with vision, who is able to understand incentives and nonetheless take long-term action that isn’t locally rewarded. YC startups constantly talk about long-term investments into culture and hiring and onboarding processes that are costly in (I’d guess) the 3-12 month time-frame but extremely valuable in the 1-5 year time frame.
Saying that a system is heavily shaped by incentives doesn’t seem to me to imply that the system is heavily short-sighted. Companies like Amazon and Facebook are of course heavily shaped by incentives yet have quite long-term thinking in their leaders, who often do things that look like locally wasted effort because they have a vision of how it will pay off years down the line.
Speaking about the local political situation, I think safety investment from AI capabilities companies can be thought of as investing into problems that will come up in the future. As a more cynical hypothesis, I think it can also be usefully thought of as a worthwhile political ploy to attract talent and look responsible to regulators and intelligentsia.
(Added: Bottom-line: Following incentives does not mean short-sighted.)
I think there are a whole bunch of inputs that determine a company’s success. Research direction, management culture, engineering culture, product direction, etc. To be a really successful startup you often just need to have exceptional vision on one or a small number of these inputs, possibly even just once or twice. I’d guess it’s exceedingly rare for a company to have leaders with consistently great vision across all the inputs that go into a company. Everything else will constantly revert towards local incentives. So, even in a company with top 1 percentile leadership vision quality, most things will still be messed up because of incentives most of the time.
For the purposes of the original question of whether people are overinvesting in interp due to it being useful for capabilities and therefore being incentivized, I think there’s a pretty important distinction between direct usefulness and this sort of diffuse public good that is very hard to attribute. Things with large but diffuse impact are much more often underincentivized and often mostly done as a labor of love. In general, the more you think an organization is shaped by incentives that are hard to fight against, the more you should expect diffusely impactful things to be relatively neglected.
Separately, it’s also not clear to me that the diffuse intuitions from interpretability have actually helped people a lot with capabilities. Obviously this is very hard to attribute, and I can’t speak too much about details, but it feels to me like the most important intuitions come from elsewhere. What’s an example of an interpretability work that you feel has affected capabilities intuitions a lot?
I have a model whereby ~all very successful large companies require a leader with vision, who is able to understand incentives and nonetheless take long-term action that isn’t locally rewarded. YC startups constantly talk about long-term investments into culture and hiring and onboarding processes that are costly in (I’d guess) the 3-12 month time-frame but extremely valuable in the 1-5 year time frame.
Saying that a system is heavily shaped by incentives doesn’t seem to me to imply that the system is heavily short-sighted. Companies like Amazon and Facebook are of course heavily shaped by incentives yet have quite long-term thinking in their leaders, who often do things that look like locally wasted effort because they have a vision of how it will pay off years down the line.
Speaking about the local political situation, I think safety investment from AI capabilities companies can be thought of as investing into problems that will come up in the future. As a more cynical hypothesis, I think it can also be usefully thought of as a worthwhile political ploy to attract talent and look responsible to regulators and intelligentsia.
(Added: Bottom-line: Following incentives does not mean short-sighted.)
I think there are a whole bunch of inputs that determine a company’s success. Research direction, management culture, engineering culture, product direction, etc. To be a really successful startup you often just need to have exceptional vision on one or a small number of these inputs, possibly even just once or twice. I’d guess it’s exceedingly rare for a company to have leaders with consistently great vision across all the inputs that go into a company. Everything else will constantly revert towards local incentives. So, even in a company with top 1 percentile leadership vision quality, most things will still be messed up because of incentives most of the time.