Two shovel-ready theory projects in interpretability.
Most scientific work, especially theoretical research, isn’t “shovel-ready.” It’s difficult to generate well-defined, self-contained projects where the path forward is clear without extensive background context. In my experience, this is extra true of theory work, where most of the labour if often about figuring out what the project should actually be, because the requirements are unclear or confused.
Nevertheless, I currently have two theory projects related to computation in superposition in my backlog that I think are valuable and that maybe have reasonably clear execution paths. Someone just needs to crunch a bunch of math and write up the results.
Impact story sketch: We now have some very basic theory for how computation in superposition could work[1]. But I think there’s more to do there that could help our understanding. If superposition happens in real models, better theoretical grounding could help us understand what we’re seeing in these models, and how to un-superpose them back into sensible individual circuits and mechanisms we can analyse one at a time. With sufficient understanding, we might even gain some insight into how circuits develop during training.
This post has a framework for compressing lots of small residual MLPs into one big residual MLP. Both projects are about improving this framework.
1) I think the framework can probably be pretty straightforwardly extended to transformers. This would help make the theory more directly applicable to language models. The key thing to show there is how to do superposition in attention. I suspect you can more or less use the same construction the post uses, with individual attention heads now playing the role of neurons. I put maybe two work days into trying this before giving it up in favour of other projects. I didn’t run into any notable barriers, the calculations just proved to be more extensive than I’d hoped they’d be.
2) Improve error terms for circuits in superposition at finite width. The construction in this post is not optimised to be efficient at finite network width. Maybe the lowest hanging fruit to improving it is changing the hyperparameter p, the probability with which we connect a circuit to a set of neurons in the big network. We set p=log(M)mM in the post, where M is the MLP width of the big network and m is the minimum neuron count per layer the circuit would need without superposition. The log(M) choice here was pretty arbitrary. We just picked it because it made the proof easier. Recently, Apollo played around a bit with superposing very basic one-feature circuits into a real network, and IIRC a range of p values seemed to work ok. Getting tighter bounds on the error terms as a function of p that are useful at finite width would be helpful here. Then we could better predict how many circuits networks can superpose in real life as a function of their parameter count. If I was tackling this project, I might start by just trying really hard to get a better error formula directly for a while. Just crunch the combinatorics. If that fails, I’d maybe switch to playing more with various choices of p in small toy networks to develop intuition. Maybe plot some scaling laws of performance with p at various network widths in 1-3 very simple settings. Then try to guess a formula from those curves and try to prove it’s correct.
Another very valuable project is of course to try training models to do computation in superposition instead of hard coding it. But Stefan mentioned that one already.
1 Boolean computations in superposition LW post. 2 Boolean computations paper of LW post with more worked out but some of the fun stuff removed. 3 Some proofs about information-theoretic limits of comp-sup. 4 General circuits in superposition LW post. If I missed something, a link would be appreciated.
Two shovel-ready theory projects in interpretability.
Most scientific work, especially theoretical research, isn’t “shovel-ready.” It’s difficult to generate well-defined, self-contained projects where the path forward is clear without extensive background context. In my experience, this is extra true of theory work, where most of the labour if often about figuring out what the project should actually be, because the requirements are unclear or confused.
Nevertheless, I currently have two theory projects related to computation in superposition in my backlog that I think are valuable and that maybe have reasonably clear execution paths. Someone just needs to crunch a bunch of math and write up the results.
Impact story sketch: We now have some very basic theory for how computation in superposition could work[1]. But I think there’s more to do there that could help our understanding. If superposition happens in real models, better theoretical grounding could help us understand what we’re seeing in these models, and how to un-superpose them back into sensible individual circuits and mechanisms we can analyse one at a time. With sufficient understanding, we might even gain some insight into how circuits develop during training.
This post has a framework for compressing lots of small residual MLPs into one big residual MLP. Both projects are about improving this framework.
1) I think the framework can probably be pretty straightforwardly extended to transformers. This would help make the theory more directly applicable to language models. The key thing to show there is how to do superposition in attention. I suspect you can more or less use the same construction the post uses, with individual attention heads now playing the role of neurons. I put maybe two work days into trying this before giving it up in favour of other projects. I didn’t run into any notable barriers, the calculations just proved to be more extensive than I’d hoped they’d be.
2) Improve error terms for circuits in superposition at finite width. The construction in this post is not optimised to be efficient at finite network width. Maybe the lowest hanging fruit to improving it is changing the hyperparameter p, the probability with which we connect a circuit to a set of neurons in the big network. We set p=log(M)mM in the post, where M is the MLP width of the big network and m is the minimum neuron count per layer the circuit would need without superposition. The log(M) choice here was pretty arbitrary. We just picked it because it made the proof easier. Recently, Apollo played around a bit with superposing very basic one-feature circuits into a real network, and IIRC a range of p values seemed to work ok. Getting tighter bounds on the error terms as a function of p that are useful at finite width would be helpful here. Then we could better predict how many circuits networks can superpose in real life as a function of their parameter count. If I was tackling this project, I might start by just trying really hard to get a better error formula directly for a while. Just crunch the combinatorics. If that fails, I’d maybe switch to playing more with various choices of p in small toy networks to develop intuition. Maybe plot some scaling laws of performance with p at various network widths in 1-3 very simple settings. Then try to guess a formula from those curves and try to prove it’s correct.
Another very valuable project is of course to try training models to do computation in superposition instead of hard coding it. But Stefan mentioned that one already.
1 Boolean computations in superposition LW post. 2 Boolean computations paper of LW post with more worked out but some of the fun stuff removed. 3 Some proofs about information-theoretic limits of comp-sup. 4 General circuits in superposition LW post. If I missed something, a link would be appreciated.