1a3orn comments on If interpretability research goes well, it may get dangerous

1a3orn 4 Apr 2023 12:57 UTC
24 points
9
One way that people think about the situation, which I think leads them to underestimate the costs of secrecy, is that they think about intepretability as a mostly theoretical research program. If you think of it that way, then I think it disguises the costs of secrecy.

But an addition, to a research program, interpretability is in part about producing useful technical artifacts for steering DL, i.e., standard interpretability tools. And technology becomes good because it is used.

It improves through tinkering, incremental change, and ten thousand slight changes in which each increase improves some positive quality by 10% individually. Look at what the first cars looked like and how many transformations they went through to get to where they are now. Or look it the history of the gun. Or, what is relevant for our causes, look at the continuing evolution of open source DL libraries from TF to PyTorch to PyTorch 2. This software became more powerful and more useful because thousands of people have contributed, complained, changed one line of documentation, added boolean flags, completely refactored, and so on and so forth.

If you think of interpretability being “solved” through the creation one big insight—I think it becomes more likely that interpretability could be closed without tremendous harm. But if you think of it being “solved” through the existence of an excellent torch-shard-interpret package used by everyone who uses PyTorch, together with corresponding libraries for Jax, then I think the costs of secrecy become much more obvious.

Would this increase capabilities? Probably! But I think a world 5 years hence, where capabilities has been moved up 6 months relative to zero interpretability artifacts, but where everyone can look relatively easily into the guts of their models and in fact does so look to improve them, seems preferable to a world 6 months delayed but without these libraries.

I could be wrong about this being the correct framing. And of course, these frames must mix somewhat. But the above article seem to assume the research-insight framing, which I think is not obviously correct.