Hey Joseph, thanks for the substantial reply and the questions!
Why call this a theory of interpretability as opposed to a theory of neural networks?
Yeah this is something I am unsure about myself (I wrote: “something that I’m clumsily thinking of as ‘the mathematics of (the interpretability of) deep learning-based AI’”). But I think I was imagining that a ‘theory of neural networks’ would be definitely broader than what I have in mind as being useful for not-kill-everyoneism. I suppose I imagine it including lots of things that are interesting about NNs mathematically or scientifically but which aren’t really contributing to our ability to understand and manage the intelligences that NNs give rise to. So I wanted to try to shift the emphasis away from ‘understanding NNs’ and towards ‘interpreting AI’.
But maybe the distinction is more minor than I was originally worried about; I’m not sure.
have you made any progress on this topic or do you know anyone who would describe this explicitly as their research agenda? If so what areas are they working in.
No, I haven’t really. It was—and maybe still is—a sort of plan B of mine. I don’t know anyone who I would say has this as their research agenda. I think the closest/next best thing people are well known, e.g. the more theoretical parts of Anthropic/Neel’s work and more recently the interest in singular learning theory from Jesse Hoogland, Alexander GO, Lucius Bushnaq and maybe others. (afaict there is a belief that it’s more than just ‘theory of NNs’ but can actually tell us something about safety of the AIs)
One thing I struggle to understand, and might bet against is that this won’t involve studying toy models. To my mind, Neel’s grokking work, Toy Models of Superposition, Bhillal’s “A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations” all seems to be contributing towards important factors that no comprehensive theory of Neural Networks could ignore.…
I think maybe I didn’t express myself clearly or the analogy I tried to make didn’t work as intended, because I think maybe we actually agree here(!). I think one reason I made it confusing is because my default position is more skeptical about MI than a lot of readers probably....so, with regards to the part where I said: “it is reasonable that the early stages of rigorous development don’t naively ‘look like’ the kinds of things we ultimately want to be talking about. This is very relevant to bear in mind when considering things like the mechanistic interpretability of toy models.” What I was trying to get at is that to me proving e.g. some mathematical fact about superposition in a toy model doesn’t look like the kind of ‘intepretability of AI’ that you really ultimately want, it looks too low-level. It’s a ‘toy model’ in the NN sense, but its not a toy version of the hard part of the problem. But I was trying to say that you would indeed have to let people like mathematicians actually ask these questions—i.e ask the questions about e.g. superposition that they would most want to know the answers to, rather than forcing them to only do work that obviously showed some connection to the bigger theme of the actual cognition of intelligent agents or whatever.
Thanks for the suggestions about next steps and for writing about what you’re most interested in seeing. I think your second suggestion in particular is close to the sort of thing I’d be most interested in doing. But I think in practice, a number of factors have held me back from going down this route myself:
Main thing holding me back is probably something like: There just currently aren’t enough people doing it—no critical mass. Obviously there’s that classic game theoretic element here in that plausibly lots of people’s minds would be simultaneously changed by there being a critical mass and so if we all dived in at once, it just works out. But it doesn’t mean I can solve the issue myself. I would want way more people seriously signed up to doing this stuff including people with more experience than myself (and hopefully the possibility that I would have at least some ‘access’ to those people/opportunity to learn from them etc.) which seems quite unlikely.
It’s really slow and difficult. I have had the impression talking to some people in the field that they like the sound of this sort of thing but I often feel that they are probably underestimating how slow and incremental it is.
And a related issue is sort of the existence of jobs/job security/funding to seriously pursue it for a while without worrying too much in the short term about getting concrete results out.
Thanks Spencer! I’d love to respond in detail but alas, I lack the time at the moment.
Some quick points:
I’m also really excited about SLT work. I’m curious to what degree there’s value in looking at toy models (such as Neel’s grokking work) and exploring them via SLT or to what extent reasoning in SLT might be reinvigorated by integrating experimental ideas/methodology from MI (such as progress measures). It feels plausible to me that there just haven’t been enough people in any of a number of intersections look at stuff and this is a good example. Not sure if you’re planning on going to this: https://www.lesswrong.com/posts/HtxLbGvD7htCybLmZ/singularities-against-the-singularity-announcing-workshop-on but it’s probably not in the cards for me. I’m wondering if promoting it to people with MI experience could be good.
I totally get what you’re saying about toy model in sense A or B doesn’t necessarily equate to a toy model being a version of the hard part of the problem. This explanation helped a lot, thank you!
I hear what you are saying about next steps being challenging for logistical and coordination issues and because the problem is just really hard! I guess the recourse we have is something like: Look for opportunities/chances that might justify giving something like this more attention or coordination. I’m also wondering if there might be ways of dramatically lowering the bar for doing work in related areas (eg: the same way Neel writing TransformerLens got a lot more people into MI).
Looking forward to more discussions on this in the future, all the best!
Hey Joseph, thanks for the substantial reply and the questions!
Yeah this is something I am unsure about myself (I wrote: “something that I’m clumsily thinking of as ‘the mathematics of (the interpretability of) deep learning-based AI’”). But I think I was imagining that a ‘theory of neural networks’ would be definitely broader than what I have in mind as being useful for not-kill-everyoneism. I suppose I imagine it including lots of things that are interesting about NNs mathematically or scientifically but which aren’t really contributing to our ability to understand and manage the intelligences that NNs give rise to. So I wanted to try to shift the emphasis away from ‘understanding NNs’ and towards ‘interpreting AI’.
But maybe the distinction is more minor than I was originally worried about; I’m not sure.
No, I haven’t really. It was—and maybe still is—a sort of plan B of mine. I don’t know anyone who I would say has this as their research agenda. I think the closest/next best thing people are well known, e.g. the more theoretical parts of Anthropic/Neel’s work and more recently the interest in singular learning theory from Jesse Hoogland, Alexander GO, Lucius Bushnaq and maybe others. (afaict there is a belief that it’s more than just ‘theory of NNs’ but can actually tell us something about safety of the AIs)
I think maybe I didn’t express myself clearly or the analogy I tried to make didn’t work as intended, because I think maybe we actually agree here(!). I think one reason I made it confusing is because my default position is more skeptical about MI than a lot of readers probably....so, with regards to the part where I said: “it is reasonable that the early stages of rigorous development don’t naively ‘look like’ the kinds of things we ultimately want to be talking about. This is very relevant to bear in mind when considering things like the mechanistic interpretability of toy models.” What I was trying to get at is that to me proving e.g. some mathematical fact about superposition in a toy model doesn’t look like the kind of ‘intepretability of AI’ that you really ultimately want, it looks too low-level. It’s a ‘toy model’ in the NN sense, but its not a toy version of the hard part of the problem. But I was trying to say that you would indeed have to let people like mathematicians actually ask these questions—i.e ask the questions about e.g. superposition that they would most want to know the answers to, rather than forcing them to only do work that obviously showed some connection to the bigger theme of the actual cognition of intelligent agents or whatever.
Thanks for the suggestions about next steps and for writing about what you’re most interested in seeing. I think your second suggestion in particular is close to the sort of thing I’d be most interested in doing. But I think in practice, a number of factors have held me back from going down this route myself:
Main thing holding me back is probably something like: There just currently aren’t enough people doing it—no critical mass. Obviously there’s that classic game theoretic element here in that plausibly lots of people’s minds would be simultaneously changed by there being a critical mass and so if we all dived in at once, it just works out. But it doesn’t mean I can solve the issue myself. I would want way more people seriously signed up to doing this stuff including people with more experience than myself (and hopefully the possibility that I would have at least some ‘access’ to those people/opportunity to learn from them etc.) which seems quite unlikely.
It’s really slow and difficult. I have had the impression talking to some people in the field that they like the sound of this sort of thing but I often feel that they are probably underestimating how slow and incremental it is.
And a related issue is sort of the existence of jobs/job security/funding to seriously pursue it for a while without worrying too much in the short term about getting concrete results out.
Thanks Spencer! I’d love to respond in detail but alas, I lack the time at the moment.
Some quick points:
I’m also really excited about SLT work. I’m curious to what degree there’s value in looking at toy models (such as Neel’s grokking work) and exploring them via SLT or to what extent reasoning in SLT might be reinvigorated by integrating experimental ideas/methodology from MI (such as progress measures). It feels plausible to me that there just haven’t been enough people in any of a number of intersections look at stuff and this is a good example. Not sure if you’re planning on going to this: https://www.lesswrong.com/posts/HtxLbGvD7htCybLmZ/singularities-against-the-singularity-announcing-workshop-on but it’s probably not in the cards for me. I’m wondering if promoting it to people with MI experience could be good.
I totally get what you’re saying about toy model in sense A or B doesn’t necessarily equate to a toy model being a version of the hard part of the problem. This explanation helped a lot, thank you!
I hear what you are saying about next steps being challenging for logistical and coordination issues and because the problem is just really hard! I guess the recourse we have is something like: Look for opportunities/chances that might justify giving something like this more attention or coordination. I’m also wondering if there might be ways of dramatically lowering the bar for doing work in related areas (eg: the same way Neel writing TransformerLens got a lot more people into MI).
Looking forward to more discussions on this in the future, all the best!