Why are you considering Anthropic as a unified whole here? Sure, Anthropic as a whole is probably doing some work that is directly aimed towards advancing capabilities, but this just doesn’t seem true of the interp team. (I guess you could imagine that the only reason the interp team exists at Anthropic is that Anthropic believes interp is great for advancing capabilities, but this seems pretty unlikely to me.)
(Note that the criterion is “not specifically work towards advancing capabilities”, as opposed to “try not to advance capabilities”.)
I have found much more success modeling intentions and institutional incentives at the organization level than the team level.
My guess is the interpretability team is under a lot of pressure to produce insights that would help the rest of the org with capabilities work. In-general I’ve found arguments of the type of “this team in this org is working towards totally different goals than the rest of the org” to have a pretty bad track record, unless you are talking about very independent and mostly remote teams.
My guess is the interpretability team is under a lot of pressure to produce insights that would help the rest of the org with capabilities work
I would be somewhat surprised if this was true, assuming you mean a strong form of this claim (i.e. operationalizing “help with capabilities work” as relying predominantly on 1st-order effects of technical insights, rather than something like “help with capabilities work by making it easier to recruit people”, and “pressure” as something like top-down prioritization of research directions, or setting KPIs which rely on capabilities externalities, etc).
I think it’s more likely that the interpretability team(s) operate with approximately full autonomy with respect to their research directions, and to the extent that there’s any shaping of outputs, it’s happening mostly at levels like “who are we hiring” and “org culture”.
The pressure here looks more like “I want to produce work that the people around me are excited about, and the kind of thing they are most excited about is stuff that is pretty directly connected to improving capabilities”, whereby I include “getting AIs to perform a wider range of economically useful tasks” as “improving capabilities”.
I definitely don’t think this is the only pressure the team is under! There are lots of pressures that are acting on them, and my current guess is that it’s not the primary pressure, but I would be surprised if it isn’t quite substantial.
I don’t think that the interp team is a part of Anthropic just because they might help with a capabilities edge; seems clear they’d love the agenda to succeed in a way that leaves neural nets no smarter but much better understood. But I’m sure that it’s part of the calculus that this kind of fundamental research is also worth supporting because of potential capability edges. (Especially given the importance of stuff like figuring out the right scaling laws in the competition with OpenAI.)
(Fwiw I don’t take issue with this sort of thing, provided the relationship isn’t exploitative. Like if the people doing the interp work have some power/social capital, and reason to expect derived capabilities to be used responsibly.)
Why are you considering Anthropic as a unified whole here? Sure, Anthropic as a whole is probably doing some work that is directly aimed towards advancing capabilities, but this just doesn’t seem true of the interp team. (I guess you could imagine that the only reason the interp team exists at Anthropic is that Anthropic believes interp is great for advancing capabilities, but this seems pretty unlikely to me.)
(Note that the criterion is “not specifically work towards advancing capabilities”, as opposed to “try not to advance capabilities”.)
I have found much more success modeling intentions and institutional incentives at the organization level than the team level.
My guess is the interpretability team is under a lot of pressure to produce insights that would help the rest of the org with capabilities work. In-general I’ve found arguments of the type of “this team in this org is working towards totally different goals than the rest of the org” to have a pretty bad track record, unless you are talking about very independent and mostly remote teams.
I would be somewhat surprised if this was true, assuming you mean a strong form of this claim (i.e. operationalizing “help with capabilities work” as relying predominantly on 1st-order effects of technical insights, rather than something like “help with capabilities work by making it easier to recruit people”, and “pressure” as something like top-down prioritization of research directions, or setting KPIs which rely on capabilities externalities, etc).
I think it’s more likely that the interpretability team(s) operate with approximately full autonomy with respect to their research directions, and to the extent that there’s any shaping of outputs, it’s happening mostly at levels like “who are we hiring” and “org culture”.
The pressure here looks more like “I want to produce work that the people around me are excited about, and the kind of thing they are most excited about is stuff that is pretty directly connected to improving capabilities”, whereby I include “getting AIs to perform a wider range of economically useful tasks” as “improving capabilities”.
I definitely don’t think this is the only pressure the team is under! There are lots of pressures that are acting on them, and my current guess is that it’s not the primary pressure, but I would be surprised if it isn’t quite substantial.
I don’t think that the interp team is a part of Anthropic just because they might help with a capabilities edge; seems clear they’d love the agenda to succeed in a way that leaves neural nets no smarter but much better understood. But I’m sure that it’s part of the calculus that this kind of fundamental research is also worth supporting because of potential capability edges. (Especially given the importance of stuff like figuring out the right scaling laws in the competition with OpenAI.)
(Fwiw I don’t take issue with this sort of thing, provided the relationship isn’t exploitative. Like if the people doing the interp work have some power/social capital, and reason to expect derived capabilities to be used responsibly.)