I disagree with James Payor on people overestimating publishing interpretability work, and I think it’s the opposite: People underestimate how good publishing interpretability work is, primarily because a lot of people on LW view interpretability work as being solved by a single clean insight, when this is usually not the case.
To quote 1a3orn:
One way that people think about the situation, which I think leads them to underestimate the costs of secrecy, is that they think about interpretability as a mostly theoretical research program. If you think of it that way, then I think it disguises the costs of secrecy.
But an addition, to a research program, interpretability is in part about producing useful technical artifacts for steering DL, i.e., standard interpretability tools. And technology becomes good because it is used.
It improves through tinkering, incremental change, and ten thousand slight changes in which each increase improves some positive quality by 10% individually. Look at what the first cars looked like and how many transformations they went through to get to where they are now. Or look it the history of the gun. Or, what is relevant for our causes, look at the continuing evolution of open source DL libraries from TF to PyTorch to PyTorch 2. This software became more powerful and more useful because thousands of people have contributed, complained, changed one line of documentation, added boolean flags, completely refactored, and so on and so forth.
If you think of interpretability being “solved” through the creation one big insight—I think it becomes more likely that interpretability could be closed without tremendous harm. But if you think of it being “solved” through the existence of an excellent torch-shard-interpret package used by everyone who uses PyTorch, together with corresponding libraries for Jax, then I think the costs of secrecy become much more obvious.
Would this increase capabilities? Probably! But I think a world 5 years hence, where capabilities has been moved up 6 months relative to zero interpretability artifacts, but where everyone can look relatively easily into the guts of their models and in fact does so look to improve them, seems preferable to a world 6 months delayed but without these libraries.
I could be wrong about this being the correct framing. And of course, these frames must mix somewhat. But the above article seem to assume the research-insight framing, which I think is not obviously correct.
In general, I think interpretability research is net positive because capabilities will probably differentially progress towards more understandable models, where we are in a huge bottleneck right now for alignment.
I think the issue is that when you get more understandable base components, and someone builds an AGI out of those, you still don’t understand the AGI.
That research is surely helpful though if it’s being used to make better-understood things, rather than enabling folk to make worse-understood more-powerful things.
I think moving in the direction of “insights are shared with groups the researcher trusts” should broadly help with this.
Hm I should also ask if you’ve seen the results of current work and think it’s evidence that we get more understandable models, moreso than we get more capable models?
I disagree with James Payor on people overestimating publishing interpretability work, and I think it’s the opposite: People underestimate how good publishing interpretability work is, primarily because a lot of people on LW view interpretability work as being solved by a single clean insight, when this is usually not the case.
To quote 1a3orn:
In general, I think interpretability research is net positive because capabilities will probably differentially progress towards more understandable models, where we are in a huge bottleneck right now for alignment.
I think the issue is that when you get more understandable base components, and someone builds an AGI out of those, you still don’t understand the AGI.
That research is surely helpful though if it’s being used to make better-understood things, rather than enabling folk to make worse-understood more-powerful things.
I think moving in the direction of “insights are shared with groups the researcher trusts” should broadly help with this.
Hm I should also ask if you’ve seen the results of current work and think it’s evidence that we get more understandable models, moreso than we get more capable models?