Naively there are so few people working on interp, and so many people working on capabilities, that publishing is so good for relative progress. So you need a pretty strong argument that interp in particular is good for capabilities, which isn’t borne out empirically and also doesn’t seem that strong.
In general, this post feels like it’s listing a bunch of considerations that are pretty small, and the 1st order consideration is just like “do you want people to know about this interpretability work”, which seems like a relatively straightfoward “yes”.
I also seperately think that LW tends to reward people for being “capabilities cautious” more than is reasonable, and once you’ve made the decision to not specifically work towards advancing capabilities, then the capabilities externalities of your research probably don’t matter ex ante.
So you need a pretty strong argument that interp in particular is good for capabilities, which isn’t borne out empirically and also doesn’t seem that strong.
I think current interpretability has close to no capabilities externalities because it is not good yet, and delivers close to no insights into NN internals. If you had a good interpretability tool, which let you read off and understand e.g. how AlphaGo plays games to the extent that you could reimplement the algorithm by hand in C, and not need the NN anymore, then I would expect this to yield large capabilities externalities. This is the level of interpretability I aim for, and the level I think we need to make any serious progress on alignment.
If your interpretability tools cannot do things even remotely like this, I expect they are quite safe. But then I also don’t think they help much at all with alignment. There’s a roughly proportional relationship between your understanding of the network, and both your ability to align it and make it better, is what I’m saying. I doubt there’s many deep insights to be had that further the former without also furthering the latter. Maybe some insights further one a bit more than the other, but I doubt you’d be able to figure out which ones those are in advance. Often, I expect you’d only know years after the insight has been published and the field has figured out all of what can be done with it.
I think it’s all one tech tree, is what I’m saying. I don’t think neural network theory neatly decomposes into a “make strong AGI architecture” branch and a “aim AGI optimisation at a specific target” branch. Just like quantum mechanics doesn’t neatly decompose into a “make a nuclear bomb” branch and a “make a nuclear reactor” branch. In fact, in the case of NNs, I expect aiming strong optimisation is probably just straight up harder than creating strong optimisation.
By default, I think if anyone succeeds at solving alignment, they probably figured out most of what goes into making strong AGI along the way. Even just by accident. Because it’s lower in the tech tree.
I also seperately think that LW tends to reward people for being “capabilities cautious” more than is reasonable, and once you’ve made the decision to not specifically work towards advancing capabilities, then the capabilities externalities of your research probably don’t matter ex ante.
But isn’t most of the interpretability research happening from people who have not made this commitment? Anthropic, which is currently the biggest publisher of interp-research, clearly does not have a commitment to not work towards advancing capabilities, and it seems important to have thought about what things Anthropic works on do maybe substantially increase capabilities (and which things they should hold off on).
I also separately don’t buy that just because you aren’t aiming to specifically work towards advancing capabilities that therefore publishing any of your work is fine. Gwern seems to not be aiming specifically towards advancing capabilities, but nevertheless seems to have had a pretty substantial effect on capability work, at least based on talking to a bunch of researchers in DL who cite Gwern as having been influential on them.
Why are you considering Anthropic as a unified whole here? Sure, Anthropic as a whole is probably doing some work that is directly aimed towards advancing capabilities, but this just doesn’t seem true of the interp team. (I guess you could imagine that the only reason the interp team exists at Anthropic is that Anthropic believes interp is great for advancing capabilities, but this seems pretty unlikely to me.)
(Note that the criterion is “not specifically work towards advancing capabilities”, as opposed to “try not to advance capabilities”.)
I have found much more success modeling intentions and institutional incentives at the organization level than the team level.
My guess is the interpretability team is under a lot of pressure to produce insights that would help the rest of the org with capabilities work. In-general I’ve found arguments of the type of “this team in this org is working towards totally different goals than the rest of the org” to have a pretty bad track record, unless you are talking about very independent and mostly remote teams.
My guess is the interpretability team is under a lot of pressure to produce insights that would help the rest of the org with capabilities work
I would be somewhat surprised if this was true, assuming you mean a strong form of this claim (i.e. operationalizing “help with capabilities work” as relying predominantly on 1st-order effects of technical insights, rather than something like “help with capabilities work by making it easier to recruit people”, and “pressure” as something like top-down prioritization of research directions, or setting KPIs which rely on capabilities externalities, etc).
I think it’s more likely that the interpretability team(s) operate with approximately full autonomy with respect to their research directions, and to the extent that there’s any shaping of outputs, it’s happening mostly at levels like “who are we hiring” and “org culture”.
The pressure here looks more like “I want to produce work that the people around me are excited about, and the kind of thing they are most excited about is stuff that is pretty directly connected to improving capabilities”, whereby I include “getting AIs to perform a wider range of economically useful tasks” as “improving capabilities”.
I definitely don’t think this is the only pressure the team is under! There are lots of pressures that are acting on them, and my current guess is that it’s not the primary pressure, but I would be surprised if it isn’t quite substantial.
I don’t think that the interp team is a part of Anthropic just because they might help with a capabilities edge; seems clear they’d love the agenda to succeed in a way that leaves neural nets no smarter but much better understood. But I’m sure that it’s part of the calculus that this kind of fundamental research is also worth supporting because of potential capability edges. (Especially given the importance of stuff like figuring out the right scaling laws in the competition with OpenAI.)
(Fwiw I don’t take issue with this sort of thing, provided the relationship isn’t exploitative. Like if the people doing the interp work have some power/social capital, and reason to expect derived capabilities to be used responsibly.)
I think it’s probably reasonable to hold off on publishing interpretability if you strongly suspect that it also advances capabilities. But then that’s just an instance of a general principle of “maybe don’t advance capabilities”, and the interpretability part was irrelevant. I don’t really buy that interpretability is particularly likely to increase capabilities that you should have a sense of general caution around this. If you have a specific sense that e.g. working on nuclear fission could produce a bomb, then maybe you shouldn’t publish (as has historically happen with e.g. research on graphene as a neutron modulator I think), but generically not publishing physics stuff because “it might be used to build a bomb, vaguely” seems like it basically won’t matter.
I think Gwern is an interesting case, but also idk what Gwern was trying to do. I would also be surprised if Gwerns effect was “pretty substantial” by my lights (e.g. I don’t think Gwern explained > 1% or even probably 0.1% variance in capabilities, and by the time you’re calling 1000 things “pretty substantial effects on capabilities” idk what “pretty substantial” means).
I think Gwern is an interesting case, but also idk what Gwern was trying to do. I would also be surprised if Gwerns effect was “pretty substantial” by my lights (e.g. I don’t think Gwern explained > 1% or even probably 0.1% variance in capabilities, and by the time you’re calling 1000 things “pretty substantial effects on capabilities” idk what “pretty substantial” means).
This feels a bit weird. Almost no individual explains 0.1% of the variance in capabilities. In-general it seems like the effect size of norms and guidelines like the ones discussed in the OP could make on the order of 10% difference in capability speeds, which depending on your beliefs about p(doom) can go into the 0.1% to 1% increase or decrease in the chance of literally everyone going extinct. It also seems pretty reasonable for someone to think this kind of stuff doesn’t really matter at all, though I don’t currently think that.
I don’t really buy that interpretability is particularly likely to increase capabilities that you should have a sense of general caution around this.
Hmm, I don’t know why you don’t buy this. If I was trying to make AGI happen as fast as possible I would totally do a good amount of interpretability research and would probably be interested in hiring Chris Olah and other top people in the field. Chris Olah’s interpretability work is one of the most commonly used resources in graduate and undergraduate ML classes, so people clearly think it helps you get better at ML engineering, and honestly, if I was trying to develop AGI as fast as possible I would find it a lot more interesting and promising to engage with than 95%+ of academic ML research.
I also bet that if we were to run a survey on what blogposts and papers top ML people would recommend that others should read to become better ML engineers, you would find a decent number of Chris Olah’s publications in the top 10 and top 100.
I don’t understand why we should have a prior that interpretability research is inherently safer than other types of ML research?
I don’t really want to argue about language. I’ll defend “almost no individual has a pretty substantial affect on capabilities.” I think publishing norms could have a pretty substantial effect on capabilities, and also a pretty substantial effect on interpretability, and currently think the norms suggested have a tradeoff that’s bad-on-net for x-risk.
Chris Olah’s interpretability work is one of the most commonly used resources in graduate and undergraduate ML classes, so people clearly think it helps you get better at ML engineering
I think this is false, and that most ML classes are not about making people good at ML engineering. I think Olah’s stuff is disproportionately represented because it’s interesting and is presented well, and also that classes really love being like “rigorous” or something in ways that are random. Similarly, probably like proofs of the correctness of backprop are common in ML classes, but not that relevant to being a good ML engineer?
I also bet that if we were to run a survey on what blogposts and papers top ML people would recommend that others should read to become better ML engineers, you would find a decent number of Chris Olah’s publications in the top 10 and top 100.
I would be surprised if lots of ML engineers thought that Olah’s work was in the top 10 best things to read to become a better ML engineer. I less beliefs about top 100. I would take even odds (and believe something closer to 4:1 or whatever), that if you surveyed good ML engineers and ask for top 10 lists, not a single Olah interpretability piece would be in the top 10 most mentioned things. I think most of the stuff will be random things about e.g. debugging workflow, how deal with computers, how to use libraries effectively, etc. If anyone is good at ML engineering and wants to chime in, that would be neat.
I don’t understand why we should have a prior that interpretability research is inherently safer than other types of ML research?
Idk, I have the same prior about trying to e.g. prove various facts about ML stuff, or do statistical learning theory type things, or a bunch of other stuff. It’s just like, if you’re not trying to eek out more oomph from SGD, then probably the stuff you’re doing isn’t going to allow you to eek out more oomph from SGD, because it’s kinda hard to do that and people are trying many things.
I don’t really want to argue about language. I’ll defend “almost no individual has a pretty substantial affect on capabilities.” I think publishing norms could have a pretty substantial effect on capabilities, and also a pretty substantial effect on interpretability, and currently think the norms suggested have a tradeoff that’s bad-on-net for x-risk.
Yep, makes sense. No need to argue about language. In that case I do think Gwern is a pretty interesting datapoint, and seems worth maybe digging more into.
I would be surprised if lots of ML engineers thought that Olah’s work was in the top 10 best things to read to become a better ML engineer. I less beliefs about top 100. I would take even odds (and believe something closer to 4:1 or whatever), that if you surveyed good ML engineers and ask for top 10 lists, not a single Olah interpretability piece would be in the top 10 most mentioned things. I think most of the stuff will be random things about e.g. debugging workflow, how deal with computers, how to use libraries effectively, etc. If anyone is good at ML engineering and wants to chime in, that would be neat.
I would take a bet at 2:1 in my favor for the top 10 thing. Top 10 is a pretty high bar, so I am not at even odds.
Idk, I have the same prior about trying to e.g. prove various facts about ML stuff, or do statistical learning theory type things, or a bunch of other stuff. It’s just like, if you’re not trying to eek out more oomph from SGD, then probably the stuff you’re doing isn’t going to allow you to eek out more oomph from SGD, because it’s kinda hard to do that and people are trying many things.
Hmm, yeah, I do think I disagree with the generator here, but I don’t feel super confident and this perspective seems at least plausible to me. I don’t believe it with enough probability to make me think that there is negligible net risk, and I feel like I have a relatively easy time coming up with counterexamples from science and other industries (the nuclear scientists working on nuclear fission did indeed not work on making weapons, and many people were working on making weapons).
Not sure how much it’s worth digging more into this here.
Which sorts of works are you referring to on Chris Olah’s blog? I see mostly vision interpretability work (which has not helped with vision capabilities), RNN stuff (which essentially does not help capabilities because of transformers) and one article on back-prop, which is more engineering-adjacent but probably replaceable (I’ve seen pretty similar explanations in at least one publicly available Stanford course).
The basic things studied here transfer pretty well to other architectures. Understanding the hierarchical nature of features transfer from vision to language, and indeed when I hear people talk about how features are structured in LLMs, they often use language borrowed from what we know about how they are structured in vision (i.e. having metaphorical edge-detectors/syntax-detectors that then feed up into higher level concepts, etc.)
I am not sure what you mean. Anthropic clearly is aiming to make capability advances. The linked comment just says that they aren’t seeking capability advances for the sake of capability advances, but want some benefit like better insight into safety, or better competitive positioning.
Oh I see; I read too quickly. I interpreted your statement as “Anthropic clearly couldn’t care less about shortening timelines,” and I wanted to show that the interpretability team seems to care.
Especially since this post is about capabilities externalities from interpretability research, and your statement introduces Anthropic as “Anthropic, which is currently the biggest publisher of interp-research.” Some readers might conclude corollaries like “Anthropic’s interpretability team doesn’t care about advancing capabilities.”
Assume you had a tool that basically allows to you explain the entire network, every circuit and mechanism, etc. The tool spits out explanations that are easy to understand and easy to connect to specific parts of the network, e.g. attention head x is doing y. Would you publish this tool to the entire world or keep it private or semi-private for a while?
I think this case is unclear, but also not central because I’m imagining the primary benefit of publishing interp research as being making interp research go faster, and this seems like you’ve basically “solved interp”, so the benefits no longer really apply?
Similarly, if you thought that you should publish capabilities research to accelerate to AGI, and you found out how to build AGI, then whether you should publish is not really relevant anymore.
Naively there are so few people working on interp, and so many people working on capabilities, that publishing is so good for relative progress. So you need a pretty strong argument that interp in particular is good for capabilities, which isn’t borne out empirically and also doesn’t seem that strong.
In general, this post feels like it’s listing a bunch of considerations that are pretty small, and the 1st order consideration is just like “do you want people to know about this interpretability work”, which seems like a relatively straightfoward “yes”.
I also seperately think that LW tends to reward people for being “capabilities cautious” more than is reasonable, and once you’ve made the decision to not specifically work towards advancing capabilities, then the capabilities externalities of your research probably don’t matter ex ante.
I think current interpretability has close to no capabilities externalities because it is not good yet, and delivers close to no insights into NN internals. If you had a good interpretability tool, which let you read off and understand e.g. how AlphaGo plays games to the extent that you could reimplement the algorithm by hand in C, and not need the NN anymore, then I would expect this to yield large capabilities externalities. This is the level of interpretability I aim for, and the level I think we need to make any serious progress on alignment.
If your interpretability tools cannot do things even remotely like this, I expect they are quite safe. But then I also don’t think they help much at all with alignment. There’s a roughly proportional relationship between your understanding of the network, and both your ability to align it and make it better, is what I’m saying. I doubt there’s many deep insights to be had that further the former without also furthering the latter. Maybe some insights further one a bit more than the other, but I doubt you’d be able to figure out which ones those are in advance. Often, I expect you’d only know years after the insight has been published and the field has figured out all of what can be done with it.
I think it’s all one tech tree, is what I’m saying. I don’t think neural network theory neatly decomposes into a “make strong AGI architecture” branch and a “aim AGI optimisation at a specific target” branch. Just like quantum mechanics doesn’t neatly decompose into a “make a nuclear bomb” branch and a “make a nuclear reactor” branch. In fact, in the case of NNs, I expect aiming strong optimisation is probably just straight up harder than creating strong optimisation.
By default, I think if anyone succeeds at solving alignment, they probably figured out most of what goes into making strong AGI along the way. Even just by accident. Because it’s lower in the tech tree.
But isn’t most of the interpretability research happening from people who have not made this commitment? Anthropic, which is currently the biggest publisher of interp-research, clearly does not have a commitment to not work towards advancing capabilities, and it seems important to have thought about what things Anthropic works on do maybe substantially increase capabilities (and which things they should hold off on).
I also separately don’t buy that just because you aren’t aiming to specifically work towards advancing capabilities that therefore publishing any of your work is fine. Gwern seems to not be aiming specifically towards advancing capabilities, but nevertheless seems to have had a pretty substantial effect on capability work, at least based on talking to a bunch of researchers in DL who cite Gwern as having been influential on them.
Why are you considering Anthropic as a unified whole here? Sure, Anthropic as a whole is probably doing some work that is directly aimed towards advancing capabilities, but this just doesn’t seem true of the interp team. (I guess you could imagine that the only reason the interp team exists at Anthropic is that Anthropic believes interp is great for advancing capabilities, but this seems pretty unlikely to me.)
(Note that the criterion is “not specifically work towards advancing capabilities”, as opposed to “try not to advance capabilities”.)
I have found much more success modeling intentions and institutional incentives at the organization level than the team level.
My guess is the interpretability team is under a lot of pressure to produce insights that would help the rest of the org with capabilities work. In-general I’ve found arguments of the type of “this team in this org is working towards totally different goals than the rest of the org” to have a pretty bad track record, unless you are talking about very independent and mostly remote teams.
I would be somewhat surprised if this was true, assuming you mean a strong form of this claim (i.e. operationalizing “help with capabilities work” as relying predominantly on 1st-order effects of technical insights, rather than something like “help with capabilities work by making it easier to recruit people”, and “pressure” as something like top-down prioritization of research directions, or setting KPIs which rely on capabilities externalities, etc).
I think it’s more likely that the interpretability team(s) operate with approximately full autonomy with respect to their research directions, and to the extent that there’s any shaping of outputs, it’s happening mostly at levels like “who are we hiring” and “org culture”.
The pressure here looks more like “I want to produce work that the people around me are excited about, and the kind of thing they are most excited about is stuff that is pretty directly connected to improving capabilities”, whereby I include “getting AIs to perform a wider range of economically useful tasks” as “improving capabilities”.
I definitely don’t think this is the only pressure the team is under! There are lots of pressures that are acting on them, and my current guess is that it’s not the primary pressure, but I would be surprised if it isn’t quite substantial.
I don’t think that the interp team is a part of Anthropic just because they might help with a capabilities edge; seems clear they’d love the agenda to succeed in a way that leaves neural nets no smarter but much better understood. But I’m sure that it’s part of the calculus that this kind of fundamental research is also worth supporting because of potential capability edges. (Especially given the importance of stuff like figuring out the right scaling laws in the competition with OpenAI.)
(Fwiw I don’t take issue with this sort of thing, provided the relationship isn’t exploitative. Like if the people doing the interp work have some power/social capital, and reason to expect derived capabilities to be used responsibly.)
I think it’s probably reasonable to hold off on publishing interpretability if you strongly suspect that it also advances capabilities. But then that’s just an instance of a general principle of “maybe don’t advance capabilities”, and the interpretability part was irrelevant. I don’t really buy that interpretability is particularly likely to increase capabilities that you should have a sense of general caution around this. If you have a specific sense that e.g. working on nuclear fission could produce a bomb, then maybe you shouldn’t publish (as has historically happen with e.g. research on graphene as a neutron modulator I think), but generically not publishing physics stuff because “it might be used to build a bomb, vaguely” seems like it basically won’t matter.
I think Gwern is an interesting case, but also idk what Gwern was trying to do. I would also be surprised if Gwerns effect was “pretty substantial” by my lights (e.g. I don’t think Gwern explained > 1% or even probably 0.1% variance in capabilities, and by the time you’re calling 1000 things “pretty substantial effects on capabilities” idk what “pretty substantial” means).
This feels a bit weird. Almost no individual explains 0.1% of the variance in capabilities. In-general it seems like the effect size of norms and guidelines like the ones discussed in the OP could make on the order of 10% difference in capability speeds, which depending on your beliefs about p(doom) can go into the 0.1% to 1% increase or decrease in the chance of literally everyone going extinct. It also seems pretty reasonable for someone to think this kind of stuff doesn’t really matter at all, though I don’t currently think that.
Hmm, I don’t know why you don’t buy this. If I was trying to make AGI happen as fast as possible I would totally do a good amount of interpretability research and would probably be interested in hiring Chris Olah and other top people in the field. Chris Olah’s interpretability work is one of the most commonly used resources in graduate and undergraduate ML classes, so people clearly think it helps you get better at ML engineering, and honestly, if I was trying to develop AGI as fast as possible I would find it a lot more interesting and promising to engage with than 95%+ of academic ML research.
I also bet that if we were to run a survey on what blogposts and papers top ML people would recommend that others should read to become better ML engineers, you would find a decent number of Chris Olah’s publications in the top 10 and top 100.
I don’t understand why we should have a prior that interpretability research is inherently safer than other types of ML research?
I don’t really want to argue about language. I’ll defend “almost no individual has a pretty substantial affect on capabilities.” I think publishing norms could have a pretty substantial effect on capabilities, and also a pretty substantial effect on interpretability, and currently think the norms suggested have a tradeoff that’s bad-on-net for x-risk.
I think this is false, and that most ML classes are not about making people good at ML engineering. I think Olah’s stuff is disproportionately represented because it’s interesting and is presented well, and also that classes really love being like “rigorous” or something in ways that are random. Similarly, probably like proofs of the correctness of backprop are common in ML classes, but not that relevant to being a good ML engineer?
I would be surprised if lots of ML engineers thought that Olah’s work was in the top 10 best things to read to become a better ML engineer. I less beliefs about top 100. I would take even odds (and believe something closer to 4:1 or whatever), that if you surveyed good ML engineers and ask for top 10 lists, not a single Olah interpretability piece would be in the top 10 most mentioned things. I think most of the stuff will be random things about e.g. debugging workflow, how deal with computers, how to use libraries effectively, etc. If anyone is good at ML engineering and wants to chime in, that would be neat.
Idk, I have the same prior about trying to e.g. prove various facts about ML stuff, or do statistical learning theory type things, or a bunch of other stuff. It’s just like, if you’re not trying to eek out more oomph from SGD, then probably the stuff you’re doing isn’t going to allow you to eek out more oomph from SGD, because it’s kinda hard to do that and people are trying many things.
Yep, makes sense. No need to argue about language. In that case I do think Gwern is a pretty interesting datapoint, and seems worth maybe digging more into.
I would take a bet at 2:1 in my favor for the top 10 thing. Top 10 is a pretty high bar, so I am not at even odds.
Hmm, yeah, I do think I disagree with the generator here, but I don’t feel super confident and this perspective seems at least plausible to me. I don’t believe it with enough probability to make me think that there is negligible net risk, and I feel like I have a relatively easy time coming up with counterexamples from science and other industries (the nuclear scientists working on nuclear fission did indeed not work on making weapons, and many people were working on making weapons).
Not sure how much it’s worth digging more into this here.
Which sorts of works are you referring to on Chris Olah’s blog? I see mostly vision interpretability work (which has not helped with vision capabilities), RNN stuff (which essentially does not help capabilities because of transformers) and one article on back-prop, which is more engineering-adjacent but probably replaceable (I’ve seen pretty similar explanations in at least one publicly available Stanford course).
I’ve seen a lot of the articles here used in various ML syllabi: https://distill.pub/
The basic things studied here transfer pretty well to other architectures. Understanding the hierarchical nature of features transfer from vision to language, and indeed when I hear people talk about how features are structured in LLMs, they often use language borrowed from what we know about how they are structured in vision (i.e. having metaphorical edge-detectors/syntax-detectors that then feed up into higher level concepts, etc.)
This statement seems false based on this comment from Chris Olah.
I am not sure what you mean. Anthropic clearly is aiming to make capability advances. The linked comment just says that they aren’t seeking capability advances for the sake of capability advances, but want some benefit like better insight into safety, or better competitive positioning.
Oh I see; I read too quickly. I interpreted your statement as “Anthropic clearly couldn’t care less about shortening timelines,” and I wanted to show that the interpretability team seems to care.
Especially since this post is about capabilities externalities from interpretability research, and your statement introduces Anthropic as “Anthropic, which is currently the biggest publisher of interp-research.” Some readers might conclude corollaries like “Anthropic’s interpretability team doesn’t care about advancing capabilities.”
Makes sense, sorry for the confusion.
Just to get some intuitions.
Assume you had a tool that basically allows to you explain the entire network, every circuit and mechanism, etc. The tool spits out explanations that are easy to understand and easy to connect to specific parts of the network, e.g. attention head x is doing y. Would you publish this tool to the entire world or keep it private or semi-private for a while?
I think this case is unclear, but also not central because I’m imagining the primary benefit of publishing interp research as being making interp research go faster, and this seems like you’ve basically “solved interp”, so the benefits no longer really apply?
Similarly, if you thought that you should publish capabilities research to accelerate to AGI, and you found out how to build AGI, then whether you should publish is not really relevant anymore.
agreed, but also, interpretability is unusually impactful capabilities work