Thanks for writing the post, and it’s great to see that (at least implicitly) lots of the people doing mechanistic interpretability (MI) are talking to each other somewhat.
Some comments and questions:
I think “science of deep learning” would be a better term than “deep learning theory” for what you’re describing, given that I think all the phenomena you list aren’t yet theoretically grounded or explained in a mathematical way, and are rather robust empirical observations. Deep learning theory could be useful, especially if it had results concerning the internals of the network, but I think that’s a different genre of work to the science of DL work.
In your description of the relevance of the lottery ticket hypothesis (LTH), it feels like a bit of a non-sequitur to immediately discuss removing dangerous circuits at initialisation. I guess you think this is because lottery tickets are in some way about removing circuits at the beginning of training (although currently we only know how to find out which circuits by getting to the end of training)? I think the LTH potentially has broader relevance for MI, i.e.: if lottery tickets do exist and are of equal performance, then it’s possible they’d be easier to interpret (due to increased sparsity); or just understanding what the existence of lottery tickets means for what circuits are more likely to emerge during neural network training.
When you say “Automating Mechanistic Interpretability research”, do you mean automating (1) the task of interpreting a given network (automating MI), or automating (2) the research of building methods/understanding/etc. that enable us to better-interpret neural networks (automating MI Research)? I realise that a lot of current MI research, even if the ultimate goal is (2), is mostly currently doing (1) as a first step. Most of the text in that section implies automating (1) to me, but “Eventually, we might also want to automate the process of deciding which interventions to perform on the model to improve AI safety” seems to lean more towards automating (2), which comes under generally approach of automating alignment research. Obviously it would be great to be able to do both of them, but automating (1) seems both much more tractable, and also probably necessary to enable scalable interpretability of large models, whereas (2) is potentially less necessary for MI research to be useful for AI safety.
Thanks for writing the post, and it’s great to see that (at least implicitly) lots of the people doing mechanistic interpretability (MI) are talking to each other somewhat.
Some comments and questions:
I think “science of deep learning” would be a better term than “deep learning theory” for what you’re describing, given that I think all the phenomena you list aren’t yet theoretically grounded or explained in a mathematical way, and are rather robust empirical observations. Deep learning theory could be useful, especially if it had results concerning the internals of the network, but I think that’s a different genre of work to the science of DL work.
In your description of the relevance of the lottery ticket hypothesis (LTH), it feels like a bit of a non-sequitur to immediately discuss removing dangerous circuits at initialisation. I guess you think this is because lottery tickets are in some way about removing circuits at the beginning of training (although currently we only know how to find out which circuits by getting to the end of training)? I think the LTH potentially has broader relevance for MI, i.e.: if lottery tickets do exist and are of equal performance, then it’s possible they’d be easier to interpret (due to increased sparsity); or just understanding what the existence of lottery tickets means for what circuits are more likely to emerge during neural network training.
When you say “Automating Mechanistic Interpretability research”, do you mean automating (1) the task of interpreting a given network (automating MI), or automating (2) the research of building methods/understanding/etc. that enable us to better-interpret neural networks (automating MI Research)? I realise that a lot of current MI research, even if the ultimate goal is (2), is mostly currently doing (1) as a first step.
Most of the text in that section implies automating (1) to me, but “Eventually, we might also want to automate the process of deciding which interventions to perform on the model to improve AI safety” seems to lean more towards automating (2), which comes under generally approach of automating alignment research. Obviously it would be great to be able to do both of them, but automating (1) seems both much more tractable, and also probably necessary to enable scalable interpretability of large models, whereas (2) is potentially less necessary for MI research to be useful for AI safety.