Strong upvoted. I think the idea in this post could (if interpreted very generously) turn out to be pretty important for making progress at the more ambitious forms of interpretability. If we/the ais are able to pin down more details about what constitutes a valid learning story or a learnable curriculum, and tie that to the way gradient updates can be decomposed into signal on some circuit and noise on the rest of the network, then it seems like we should be able to understand each circuit as it corresponds to the endpoint of a training story, and each part of the training story should correspond to a simple modification of the circuit to add some more complexity. this is potentially better for interpretability than if it were easy for networks to learn huge chunks of structure all at once. How optimistic are you about there being general insights to be had about the structures of learnable curricula and their relation to networks’ internal structure?
Thanks! I definitely believe this, and I think we have a lot of evidence for this in both toy models and LLMs (I’m planning a couple of posts on this idea of “training stories”), and also theoretical reasons in some contexts. I’m not sure how easy it is to extend the specific approach used in the proof for parity to a general context. I think it inherently uses the fact of orthogonality of Fourier functions on boolean inputs, and understanding other ML algorithms in terms of nice orthogonal functions seems hard to do rigorously, unless you either make some kind of simplifying “presumption of independence” model on learnable algorithms or work in a toy context. In the toy case, there is a nice paper that does exactly this (explains how NN’s will tend to find “incrementally learnable” algorithms), by using a similar idea to the parity proof I outlined. This is the leap complexity paper (that Kaarel and I have looked into; I think you’ve also looked into related things)
Strong upvoted. I think the idea in this post could (if interpreted very generously) turn out to be pretty important for making progress at the more ambitious forms of interpretability. If we/the ais are able to pin down more details about what constitutes a valid learning story or a learnable curriculum, and tie that to the way gradient updates can be decomposed into signal on some circuit and noise on the rest of the network, then it seems like we should be able to understand each circuit as it corresponds to the endpoint of a training story, and each part of the training story should correspond to a simple modification of the circuit to add some more complexity. this is potentially better for interpretability than if it were easy for networks to learn huge chunks of structure all at once. How optimistic are you about there being general insights to be had about the structures of learnable curricula and their relation to networks’ internal structure?
Thanks! I definitely believe this, and I think we have a lot of evidence for this in both toy models and LLMs (I’m planning a couple of posts on this idea of “training stories”), and also theoretical reasons in some contexts. I’m not sure how easy it is to extend the specific approach used in the proof for parity to a general context. I think it inherently uses the fact of orthogonality of Fourier functions on boolean inputs, and understanding other ML algorithms in terms of nice orthogonal functions seems hard to do rigorously, unless you either make some kind of simplifying “presumption of independence” model on learnable algorithms or work in a toy context. In the toy case, there is a nice paper that does exactly this (explains how NN’s will tend to find “incrementally learnable” algorithms), by using a similar idea to the parity proof I outlined. This is the leap complexity paper (that Kaarel and I have looked into; I think you’ve also looked into related things)