To throw in my two cents, I think it’s clear that whole classes of “mechansitic interpretability” work are about better understanding architectures in ways that, if the research is successful, make it easier to improve their capabilities.
And I think this points strongly against publishing this stuff, especially if the goal is to “make this whole field more prestigious real quick”. Insofar as the prestige is coming from folks who work on AI capabilities, that’s drinking from a poisoned well (since they’ll grant the most prestige to the work that helps them accelerate).
One relevant point I don’t see discussed is that interpretability research is trying to buy us “slack”, but capabilities research consumes available “slack” as fuel until none is left.
What do I mean by this? Sometimes we do some work and are left with more understanding and grounding about what our neural nets are doing. The repeated pattern then seems to be that this helps someone design a better architecture or scale things up, until we’re left with a new more complicated network. Maybe because you helped them figure out a key detail about gradient flow in a deep network, or let them quantize the network better so they can run things faster, or whatnot.
Idk how to point at this thing properly, my examples aren’t great. I think I did a better job talking about this over here on twitter recently, if anyone is interested.
But anyhow I support folks doing their research without broadcasting their ideas to people who are trying to do capabilities work. It seems nice to me if there was mostly research closure. And I think I broadly see people overestimating the benefits publishing their work relative to keeping it within a local cluster.
And I think I broadly see people overestimating the benefits publishing their work relative to keeping it within a local cluster.
I’m surprised by this claim, can you say more? My read is weakly that people in interp under publish to wider audiences (eg getting papers into conferences), though maybe that people overpublish blog posts? (Or that I try too hard to make things go viral on Twitter lol)
I disagree with James Payor on people overestimating publishing interpretability work, and I think it’s the opposite: People underestimate how good publishing interpretability work is, primarily because a lot of people on LW view interpretability work as being solved by a single clean insight, when this is usually not the case.
To quote 1a3orn:
One way that people think about the situation, which I think leads them to underestimate the costs of secrecy, is that they think about interpretability as a mostly theoretical research program. If you think of it that way, then I think it disguises the costs of secrecy.
But an addition, to a research program, interpretability is in part about producing useful technical artifacts for steering DL, i.e., standard interpretability tools. And technology becomes good because it is used.
It improves through tinkering, incremental change, and ten thousand slight changes in which each increase improves some positive quality by 10% individually. Look at what the first cars looked like and how many transformations they went through to get to where they are now. Or look it the history of the gun. Or, what is relevant for our causes, look at the continuing evolution of open source DL libraries from TF to PyTorch to PyTorch 2. This software became more powerful and more useful because thousands of people have contributed, complained, changed one line of documentation, added boolean flags, completely refactored, and so on and so forth.
If you think of interpretability being “solved” through the creation one big insight—I think it becomes more likely that interpretability could be closed without tremendous harm. But if you think of it being “solved” through the existence of an excellent torch-shard-interpret package used by everyone who uses PyTorch, together with corresponding libraries for Jax, then I think the costs of secrecy become much more obvious.
Would this increase capabilities? Probably! But I think a world 5 years hence, where capabilities has been moved up 6 months relative to zero interpretability artifacts, but where everyone can look relatively easily into the guts of their models and in fact does so look to improve them, seems preferable to a world 6 months delayed but without these libraries.
I could be wrong about this being the correct framing. And of course, these frames must mix somewhat. But the above article seem to assume the research-insight framing, which I think is not obviously correct.
In general, I think interpretability research is net positive because capabilities will probably differentially progress towards more understandable models, where we are in a huge bottleneck right now for alignment.
I think the issue is that when you get more understandable base components, and someone builds an AGI out of those, you still don’t understand the AGI.
That research is surely helpful though if it’s being used to make better-understood things, rather than enabling folk to make worse-understood more-powerful things.
I think moving in the direction of “insights are shared with groups the researcher trusts” should broadly help with this.
Hm I should also ask if you’ve seen the results of current work and think it’s evidence that we get more understandable models, moreso than we get more capable models?
I’m perhaps misusing “publish” here, to refer to “putting stuff on the internet” and “raising awareness of the work through company Twitter” and etc.
I mostly meant to say that, as I see it, too many things that shouldn’t be published are being published, and the net effect looks plausibly terrible with little upside (though not much has happened yet in either direction).
The transformer circuits work strikes me this way, so does a bunch of others.
Also, I’m grateful to know your read! I’m broadly interested to hear this and other raw viewpoints, to get a sense of how things look to other people.
Interesting, thanks for the context. I buy that this could be bad, but I’m surprised that you see little upside—the obvious upside esp for great work like transformer circuits is getting lots of researchers nerdsniped and producing excellent and alignment relevant interp work. Which seems huge if it works
I want to say that I agree the transformer circuits work is great, and that I like it, and am glad I had the opportunity to read it! I still expect it was pretty harmful to publish.
Nerdsniping goes both ways: you also inspire things like the Hyena work trying to improve architectures based on components of what transformers can do.
I think indiscriminate hype and trying to do work that will be broadly attention-grabbing falls on the wrong side, likely doing net harm. Because capabilities improvements seem empirically easier than understanding them, and there’s a lot more attention/people/incentives for capabilities.
I think there are more targeted things that would be better for getting more good work to happen. Like research workshops or unconferences, where you choose who to invite, or building community with more aligned folk who are looking for interesting and alignment-relevant research directions. This would come with way less potential harm imo as a recruitment strategy.
Can you describe how the “local cluster” thing would work outside of keeping it within a single organization? I’d also be very interested in some case studies where people tried this.
I mostly do just mean “keeping it within a single research group” in the absence of better ideas. And I don’t have a better answer, especially not for independent folk or small orgs.
I wonder if we need an arxiv or LessWrong clone where you whitelist who you want to discuss your work with. And some scheme for helping independents find each other, or find existing groups they trust. Maybe with some “I won’t use this for capabilities work without the permission of the authors” legal docs as well.
This isn’t something I can visualize working, but maybe it has components of an answer.
To throw in my two cents, I think it’s clear that whole classes of “mechansitic interpretability” work are about better understanding architectures in ways that, if the research is successful, make it easier to improve their capabilities.
And I think this points strongly against publishing this stuff, especially if the goal is to “make this whole field more prestigious real quick”. Insofar as the prestige is coming from folks who work on AI capabilities, that’s drinking from a poisoned well (since they’ll grant the most prestige to the work that helps them accelerate).
One relevant point I don’t see discussed is that interpretability research is trying to buy us “slack”, but capabilities research consumes available “slack” as fuel until none is left.
What do I mean by this? Sometimes we do some work and are left with more understanding and grounding about what our neural nets are doing. The repeated pattern then seems to be that this helps someone design a better architecture or scale things up, until we’re left with a new more complicated network. Maybe because you helped them figure out a key detail about gradient flow in a deep network, or let them quantize the network better so they can run things faster, or whatnot.
Idk how to point at this thing properly, my examples aren’t great. I think I did a better job talking about this over here on twitter recently, if anyone is interested.
But anyhow I support folks doing their research without broadcasting their ideas to people who are trying to do capabilities work. It seems nice to me if there was mostly research closure. And I think I broadly see people overestimating the benefits publishing their work relative to keeping it within a local cluster.
I’m surprised by this claim, can you say more? My read is weakly that people in interp under publish to wider audiences (eg getting papers into conferences), though maybe that people overpublish blog posts? (Or that I try too hard to make things go viral on Twitter lol)
I disagree with James Payor on people overestimating publishing interpretability work, and I think it’s the opposite: People underestimate how good publishing interpretability work is, primarily because a lot of people on LW view interpretability work as being solved by a single clean insight, when this is usually not the case.
To quote 1a3orn:
In general, I think interpretability research is net positive because capabilities will probably differentially progress towards more understandable models, where we are in a huge bottleneck right now for alignment.
I think the issue is that when you get more understandable base components, and someone builds an AGI out of those, you still don’t understand the AGI.
That research is surely helpful though if it’s being used to make better-understood things, rather than enabling folk to make worse-understood more-powerful things.
I think moving in the direction of “insights are shared with groups the researcher trusts” should broadly help with this.
Hm I should also ask if you’ve seen the results of current work and think it’s evidence that we get more understandable models, moreso than we get more capable models?
I’m perhaps misusing “publish” here, to refer to “putting stuff on the internet” and “raising awareness of the work through company Twitter” and etc.
I mostly meant to say that, as I see it, too many things that shouldn’t be published are being published, and the net effect looks plausibly terrible with little upside (though not much has happened yet in either direction).
The transformer circuits work strikes me this way, so does a bunch of others.
Also, I’m grateful to know your read! I’m broadly interested to hear this and other raw viewpoints, to get a sense of how things look to other people.
Interesting, thanks for the context. I buy that this could be bad, but I’m surprised that you see little upside—the obvious upside esp for great work like transformer circuits is getting lots of researchers nerdsniped and producing excellent and alignment relevant interp work. Which seems huge if it works
I want to say that I agree the transformer circuits work is great, and that I like it, and am glad I had the opportunity to read it! I still expect it was pretty harmful to publish.
Nerdsniping goes both ways: you also inspire things like the Hyena work trying to improve architectures based on components of what transformers can do.
I think indiscriminate hype and trying to do work that will be broadly attention-grabbing falls on the wrong side, likely doing net harm. Because capabilities improvements seem empirically easier than understanding them, and there’s a lot more attention/people/incentives for capabilities.
I think there are more targeted things that would be better for getting more good work to happen. Like research workshops or unconferences, where you choose who to invite, or building community with more aligned folk who are looking for interesting and alignment-relevant research directions. This would come with way less potential harm imo as a recruitment strategy.
Can you describe how the “local cluster” thing would work outside of keeping it within a single organization? I’d also be very interested in some case studies where people tried this.
I mostly do just mean “keeping it within a single research group” in the absence of better ideas. And I don’t have a better answer, especially not for independent folk or small orgs.
I wonder if we need an arxiv or LessWrong clone where you whitelist who you want to discuss your work with. And some scheme for helping independents find each other, or find existing groups they trust. Maybe with some “I won’t use this for capabilities work without the permission of the authors” legal docs as well.
This isn’t something I can visualize working, but maybe it has components of an answer.