I realize this is a few months late and I’ll preface with the caveat that this is a little outside my comfort area. I generally agree with your post and am interested in practical forward steps towards making this happen.
Apologies for how long this comment turned out to be. I’m excited about people working on this and based on my own niche interests have takes on what could be done/interesting in the future with this. I don’t think any of this is info hazardy in theory but significant success in this work might be best kept on the down low.
Some questions:
Why call this a theory of interpretability as opposed to a theory of neural networks? It seems to me that we want a theory of neural networks. To the extent this theory succeeds, we would expect various outcomes like the ability a priori to predict properties or details of neural networks. One such property we would expect, as with physical systems, might be to take measurements of some properties of a system and map them to other properties (such as the ideal gas equation). If you thought it was possible to develop a more specific theory only dealing with something like interpreting interventions in neural networks (eg causal tracing) then this might be better called a theory of interpretability. To the extent Mendel understood much of inheritance prior to us understanding the structure of DNA, there might be insights to be gained without building a theory from the ground up, but my sense is that’s the kind of theory you want.
have you made any progress on this topic or do you know anyone who would describe this explicitly as their research agenda? If so what areas are they working in.
Some points:
One thing I struggle to understand, and might bet against is that this won’t involve studying toy models. To my mind, Neel’s grokking work, Toy Models of Superposition, Bhillal’s “A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations” all seems to be contributing towards important factors that no comprehensive theory of Neural Networks could ignore. Specifically, the toy models of superposition work shows concepts like importance, sparsity and correlation of features being critical to representations in neural networks. Neel’s progress measures work (and I think at least one other anthropic paper) address memorisation and generalization. Bhillal’s work shows composition of features into actual algorithms. I emphasize all of these because even if you thought an eventual comprehensive mathematical theory would use different terminology or reinterpret those studied networks differently, it seems highly likely that it would use many of those ideas and essential that it explain similar phenomena. I think other people have already taughted heavily the successes of these works so I’ll stop there and summarize this point as: if I were a mathematician, I’d start with these concepts and build them up to explain more complex systems. I’d focus on what is a feature, how to measure it’s importance/sparsity efficiently, and then deal correlation and structure in simpler networks first and try to find a mathematical theory that started making good predictions in larger systems based on those terms. I definitely wouldn’t start elsewhere.
In retrospect, maybe I agree that following one’s nose is also very important, but I’d want people to have these toy systems and various insights associated with them in mind. I’d also like to see these theoreticians work with empiricists to find and test ways of measuring quantities which would be the subject of the theorizing.
“And perhaps there really is something about nature that makes it fundamentally amenable to mathematical description in a way that just won’t apply to a large neural network trained by some sort of gradient descent? ” I am tempted to take the view that nature is amenable in a very similar way to neural networks, but I believe that nature includes many domains which we can’t/don’t/struggle to make precise predictions or relations about. Specifically, I used to work in molecular systems biology which I think we understand way better than we used to but just isn’t accessible in the way Newtonian physics is. Another example might be ecology where you might be able to predict migrations or other large effects but not specific details about ecosystems like specofic species population numbers more than 6 months in advance (this is totally unqualified conjecture to illustrate a point). So if Neural Networks are amenable to a mathematical description but only to the same extent molecular systems biology then we might be in for a hard time. The only recourse I think we have is to consider that we are have a few advantages with neural networks we don’t have in other systems we want to model (see Chris Olahs blog post here https://colah.github.io/notes/interp-v-neuro/).
I should add that part of the value I see in my own work is to create slightly more complicated transformers than those which are studied the MI literature because rather than solving algorithmic tasks they solve spatial tasks based on instructions. These systems are certainly messier and may be a good challenge for a theory of neural network mechanisms/interpretability to make detailed predictions about. I think going straight from toy models to LLMs would be like going from studying viruses/virus replication to ecosystems.
Suggestions for next steps if you agreed with me about the value of existing work and building predictions of small systems first and trying to expand predictions to larger neural networks;
Publishing/crowdsourcing/curating a literature review of anything that might be required to start building this theory (I’m guessing my knowledge of MI related work is just one small subsection of the relevant work and If I saw it as my job to create a broader theory I’d be reading way more widely.
Identifying key goals for such a theory such as expanding or identifying quantitaties which such a theory would relate to each other, reliably retropredicting details of previously studied phenomena such as polysemanticity, grokking, memorization, generalization, which features get represented and then also making novel predictions such as about how mechanisms compose in neural networks ( not sure if anyone’s published on this yet).
The two phenomena/concepts I am most interested in seeing either good empirical or theoretical work on are:
Composition of mechanisms in neural networks (analogous to gene regulation). Something like explaining how Bhillal’s extends to systems that solve more than one task at once where components can be reused and making predictions about how that gets structured in different architectures and as function of different details of training methods.
Prediction centralization of processing (eg: the human prefrontal cortex). How when why would such a module arise and possibly how do we avoid it happening in LLMs if it hasn’t started happening already.
+1 to making a concerted effort to have a mathematical theory
+1 on using the term not killeveryonism
Hey Joseph, thanks for the substantial reply and the questions!
Why call this a theory of interpretability as opposed to a theory of neural networks?
Yeah this is something I am unsure about myself (I wrote: “something that I’m clumsily thinking of as ‘the mathematics of (the interpretability of) deep learning-based AI’”). But I think I was imagining that a ‘theory of neural networks’ would be definitely broader than what I have in mind as being useful for not-kill-everyoneism. I suppose I imagine it including lots of things that are interesting about NNs mathematically or scientifically but which aren’t really contributing to our ability to understand and manage the intelligences that NNs give rise to. So I wanted to try to shift the emphasis away from ‘understanding NNs’ and towards ‘interpreting AI’.
But maybe the distinction is more minor than I was originally worried about; I’m not sure.
have you made any progress on this topic or do you know anyone who would describe this explicitly as their research agenda? If so what areas are they working in.
No, I haven’t really. It was—and maybe still is—a sort of plan B of mine. I don’t know anyone who I would say has this as their research agenda. I think the closest/next best thing people are well known, e.g. the more theoretical parts of Anthropic/Neel’s work and more recently the interest in singular learning theory from Jesse Hoogland, Alexander GO, Lucius Bushnaq and maybe others. (afaict there is a belief that it’s more than just ‘theory of NNs’ but can actually tell us something about safety of the AIs)
One thing I struggle to understand, and might bet against is that this won’t involve studying toy models. To my mind, Neel’s grokking work, Toy Models of Superposition, Bhillal’s “A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations” all seems to be contributing towards important factors that no comprehensive theory of Neural Networks could ignore.…
I think maybe I didn’t express myself clearly or the analogy I tried to make didn’t work as intended, because I think maybe we actually agree here(!). I think one reason I made it confusing is because my default position is more skeptical about MI than a lot of readers probably....so, with regards to the part where I said: “it is reasonable that the early stages of rigorous development don’t naively ‘look like’ the kinds of things we ultimately want to be talking about. This is very relevant to bear in mind when considering things like the mechanistic interpretability of toy models.” What I was trying to get at is that to me proving e.g. some mathematical fact about superposition in a toy model doesn’t look like the kind of ‘intepretability of AI’ that you really ultimately want, it looks too low-level. It’s a ‘toy model’ in the NN sense, but its not a toy version of the hard part of the problem. But I was trying to say that you would indeed have to let people like mathematicians actually ask these questions—i.e ask the questions about e.g. superposition that they would most want to know the answers to, rather than forcing them to only do work that obviously showed some connection to the bigger theme of the actual cognition of intelligent agents or whatever.
Thanks for the suggestions about next steps and for writing about what you’re most interested in seeing. I think your second suggestion in particular is close to the sort of thing I’d be most interested in doing. But I think in practice, a number of factors have held me back from going down this route myself:
Main thing holding me back is probably something like: There just currently aren’t enough people doing it—no critical mass. Obviously there’s that classic game theoretic element here in that plausibly lots of people’s minds would be simultaneously changed by there being a critical mass and so if we all dived in at once, it just works out. But it doesn’t mean I can solve the issue myself. I would want way more people seriously signed up to doing this stuff including people with more experience than myself (and hopefully the possibility that I would have at least some ‘access’ to those people/opportunity to learn from them etc.) which seems quite unlikely.
It’s really slow and difficult. I have had the impression talking to some people in the field that they like the sound of this sort of thing but I often feel that they are probably underestimating how slow and incremental it is.
And a related issue is sort of the existence of jobs/job security/funding to seriously pursue it for a while without worrying too much in the short term about getting concrete results out.
Thanks Spencer! I’d love to respond in detail but alas, I lack the time at the moment.
Some quick points:
I’m also really excited about SLT work. I’m curious to what degree there’s value in looking at toy models (such as Neel’s grokking work) and exploring them via SLT or to what extent reasoning in SLT might be reinvigorated by integrating experimental ideas/methodology from MI (such as progress measures). It feels plausible to me that there just haven’t been enough people in any of a number of intersections look at stuff and this is a good example. Not sure if you’re planning on going to this: https://www.lesswrong.com/posts/HtxLbGvD7htCybLmZ/singularities-against-the-singularity-announcing-workshop-on but it’s probably not in the cards for me. I’m wondering if promoting it to people with MI experience could be good.
I totally get what you’re saying about toy model in sense A or B doesn’t necessarily equate to a toy model being a version of the hard part of the problem. This explanation helped a lot, thank you!
I hear what you are saying about next steps being challenging for logistical and coordination issues and because the problem is just really hard! I guess the recourse we have is something like: Look for opportunities/chances that might justify giving something like this more attention or coordination. I’m also wondering if there might be ways of dramatically lowering the bar for doing work in related areas (eg: the same way Neel writing TransformerLens got a lot more people into MI).
Looking forward to more discussions on this in the future, all the best!
Hey Spencer,
I realize this is a few months late and I’ll preface with the caveat that this is a little outside my comfort area. I generally agree with your post and am interested in practical forward steps towards making this happen.
Apologies for how long this comment turned out to be. I’m excited about people working on this and based on my own niche interests have takes on what could be done/interesting in the future with this. I don’t think any of this is info hazardy in theory but significant success in this work might be best kept on the down low.
Some questions:
Why call this a theory of interpretability as opposed to a theory of neural networks? It seems to me that we want a theory of neural networks. To the extent this theory succeeds, we would expect various outcomes like the ability a priori to predict properties or details of neural networks. One such property we would expect, as with physical systems, might be to take measurements of some properties of a system and map them to other properties (such as the ideal gas equation). If you thought it was possible to develop a more specific theory only dealing with something like interpreting interventions in neural networks (eg causal tracing) then this might be better called a theory of interpretability. To the extent Mendel understood much of inheritance prior to us understanding the structure of DNA, there might be insights to be gained without building a theory from the ground up, but my sense is that’s the kind of theory you want.
have you made any progress on this topic or do you know anyone who would describe this explicitly as their research agenda? If so what areas are they working in.
Some points:
One thing I struggle to understand, and might bet against is that this won’t involve studying toy models. To my mind, Neel’s grokking work, Toy Models of Superposition, Bhillal’s “A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations” all seems to be contributing towards important factors that no comprehensive theory of Neural Networks could ignore. Specifically, the toy models of superposition work shows concepts like importance, sparsity and correlation of features being critical to representations in neural networks. Neel’s progress measures work (and I think at least one other anthropic paper) address memorisation and generalization. Bhillal’s work shows composition of features into actual algorithms. I emphasize all of these because even if you thought an eventual comprehensive mathematical theory would use different terminology or reinterpret those studied networks differently, it seems highly likely that it would use many of those ideas and essential that it explain similar phenomena. I think other people have already taughted heavily the successes of these works so I’ll stop there and summarize this point as: if I were a mathematician, I’d start with these concepts and build them up to explain more complex systems. I’d focus on what is a feature, how to measure it’s importance/sparsity efficiently, and then deal correlation and structure in simpler networks first and try to find a mathematical theory that started making good predictions in larger systems based on those terms. I definitely wouldn’t start elsewhere.
In retrospect, maybe I agree that following one’s nose is also very important, but I’d want people to have these toy systems and various insights associated with them in mind. I’d also like to see these theoreticians work with empiricists to find and test ways of measuring quantities which would be the subject of the theorizing.
“And perhaps there really is something about nature that makes it fundamentally amenable to mathematical description in a way that just won’t apply to a large neural network trained by some sort of gradient descent? ” I am tempted to take the view that nature is amenable in a very similar way to neural networks, but I believe that nature includes many domains which we can’t/don’t/struggle to make precise predictions or relations about. Specifically, I used to work in molecular systems biology which I think we understand way better than we used to but just isn’t accessible in the way Newtonian physics is. Another example might be ecology where you might be able to predict migrations or other large effects but not specific details about ecosystems like specofic species population numbers more than 6 months in advance (this is totally unqualified conjecture to illustrate a point). So if Neural Networks are amenable to a mathematical description but only to the same extent molecular systems biology then we might be in for a hard time. The only recourse I think we have is to consider that we are have a few advantages with neural networks we don’t have in other systems we want to model (see Chris Olahs blog post here https://colah.github.io/notes/interp-v-neuro/).
I should add that part of the value I see in my own work is to create slightly more complicated transformers than those which are studied the MI literature because rather than solving algorithmic tasks they solve spatial tasks based on instructions. These systems are certainly messier and may be a good challenge for a theory of neural network mechanisms/interpretability to make detailed predictions about. I think going straight from toy models to LLMs would be like going from studying viruses/virus replication to ecosystems.
Suggestions for next steps if you agreed with me about the value of existing work and building predictions of small systems first and trying to expand predictions to larger neural networks;
Publishing/crowdsourcing/curating a literature review of anything that might be required to start building this theory (I’m guessing my knowledge of MI related work is just one small subsection of the relevant work and If I saw it as my job to create a broader theory I’d be reading way more widely.
Identifying key goals for such a theory such as expanding or identifying quantitaties which such a theory would relate to each other, reliably retropredicting details of previously studied phenomena such as polysemanticity, grokking, memorization, generalization, which features get represented and then also making novel predictions such as about how mechanisms compose in neural networks ( not sure if anyone’s published on this yet).
The two phenomena/concepts I am most interested in seeing either good empirical or theoretical work on are:
Composition of mechanisms in neural networks (analogous to gene regulation). Something like explaining how Bhillal’s extends to systems that solve more than one task at once where components can be reused and making predictions about how that gets structured in different architectures and as function of different details of training methods.
Prediction centralization of processing (eg: the human prefrontal cortex). How when why would such a module arise and possibly how do we avoid it happening in LLMs if it hasn’t started happening already.
+1 to making a concerted effort to have a mathematical theory +1 on using the term not killeveryonism
Hey Joseph, thanks for the substantial reply and the questions!
Yeah this is something I am unsure about myself (I wrote: “something that I’m clumsily thinking of as ‘the mathematics of (the interpretability of) deep learning-based AI’”). But I think I was imagining that a ‘theory of neural networks’ would be definitely broader than what I have in mind as being useful for not-kill-everyoneism. I suppose I imagine it including lots of things that are interesting about NNs mathematically or scientifically but which aren’t really contributing to our ability to understand and manage the intelligences that NNs give rise to. So I wanted to try to shift the emphasis away from ‘understanding NNs’ and towards ‘interpreting AI’.
But maybe the distinction is more minor than I was originally worried about; I’m not sure.
No, I haven’t really. It was—and maybe still is—a sort of plan B of mine. I don’t know anyone who I would say has this as their research agenda. I think the closest/next best thing people are well known, e.g. the more theoretical parts of Anthropic/Neel’s work and more recently the interest in singular learning theory from Jesse Hoogland, Alexander GO, Lucius Bushnaq and maybe others. (afaict there is a belief that it’s more than just ‘theory of NNs’ but can actually tell us something about safety of the AIs)
I think maybe I didn’t express myself clearly or the analogy I tried to make didn’t work as intended, because I think maybe we actually agree here(!). I think one reason I made it confusing is because my default position is more skeptical about MI than a lot of readers probably....so, with regards to the part where I said: “it is reasonable that the early stages of rigorous development don’t naively ‘look like’ the kinds of things we ultimately want to be talking about. This is very relevant to bear in mind when considering things like the mechanistic interpretability of toy models.” What I was trying to get at is that to me proving e.g. some mathematical fact about superposition in a toy model doesn’t look like the kind of ‘intepretability of AI’ that you really ultimately want, it looks too low-level. It’s a ‘toy model’ in the NN sense, but its not a toy version of the hard part of the problem. But I was trying to say that you would indeed have to let people like mathematicians actually ask these questions—i.e ask the questions about e.g. superposition that they would most want to know the answers to, rather than forcing them to only do work that obviously showed some connection to the bigger theme of the actual cognition of intelligent agents or whatever.
Thanks for the suggestions about next steps and for writing about what you’re most interested in seeing. I think your second suggestion in particular is close to the sort of thing I’d be most interested in doing. But I think in practice, a number of factors have held me back from going down this route myself:
Main thing holding me back is probably something like: There just currently aren’t enough people doing it—no critical mass. Obviously there’s that classic game theoretic element here in that plausibly lots of people’s minds would be simultaneously changed by there being a critical mass and so if we all dived in at once, it just works out. But it doesn’t mean I can solve the issue myself. I would want way more people seriously signed up to doing this stuff including people with more experience than myself (and hopefully the possibility that I would have at least some ‘access’ to those people/opportunity to learn from them etc.) which seems quite unlikely.
It’s really slow and difficult. I have had the impression talking to some people in the field that they like the sound of this sort of thing but I often feel that they are probably underestimating how slow and incremental it is.
And a related issue is sort of the existence of jobs/job security/funding to seriously pursue it for a while without worrying too much in the short term about getting concrete results out.
Thanks Spencer! I’d love to respond in detail but alas, I lack the time at the moment.
Some quick points:
I’m also really excited about SLT work. I’m curious to what degree there’s value in looking at toy models (such as Neel’s grokking work) and exploring them via SLT or to what extent reasoning in SLT might be reinvigorated by integrating experimental ideas/methodology from MI (such as progress measures). It feels plausible to me that there just haven’t been enough people in any of a number of intersections look at stuff and this is a good example. Not sure if you’re planning on going to this: https://www.lesswrong.com/posts/HtxLbGvD7htCybLmZ/singularities-against-the-singularity-announcing-workshop-on but it’s probably not in the cards for me. I’m wondering if promoting it to people with MI experience could be good.
I totally get what you’re saying about toy model in sense A or B doesn’t necessarily equate to a toy model being a version of the hard part of the problem. This explanation helped a lot, thank you!
I hear what you are saying about next steps being challenging for logistical and coordination issues and because the problem is just really hard! I guess the recourse we have is something like: Look for opportunities/chances that might justify giving something like this more attention or coordination. I’m also wondering if there might be ways of dramatically lowering the bar for doing work in related areas (eg: the same way Neel writing TransformerLens got a lot more people into MI).
Looking forward to more discussions on this in the future, all the best!