The notion of a precision scale for interpretability is really interesting, particularly the connection with generalization/memorization. This seems like a fruitful concept to develop further.
How, then can we operationalize the loss scale of a phenomenon? Well, one way to do this is to imagine that we have some “natural” complexity parameter c that can be varied (this can be a parameter tuning model size, training length, etc.).
It could be interesting to think about the interpretation of different possible complexity parameters here. You might expect these to give rise to distinct but related notions of generalization. Here, I’m drawing on intuitions from my work connecting scaling laws to data distributions, though I hadn’t put it in exactly these terms before. (I’ll write a LW post summarizing/advertising this paper, but haven’t gotten to it yet...)
One interesting scaling regime is scaling in effective model size (given infinite training data & compute). You can also think about scaling in training data size (given infinite model capacity & compute). I think model scaling is basically akin to what you’re talking about here. Data scaling could be useful to study too, as it gets around the need to understand & measure effective model capacity. Of course these are theoretical limits, in practice one usually scales model size & data together in a compute-optimal way, you’re probably not training to convergence, etc.
If the data distribution consists of clusters of varying size (alternatively, subtasks of varying importance), then taking model size as the complexity parameter could give a notion of generalization as modeling the most important parts of the data distribution. Memorization then consists of modeling rarely observed or unimportant components. On the other hand, taking data size as the complexity parameter would suggest that generalization consists of coarsely modeling the entire data distribution, with it being memorization-like to model fine details and exceptions anywhere.
It also would be interesting to think about other complexity scaling parameters, for example, test-time compute in an AlphaZero-style setting.
If possible, we would like models in this class to be “locally simultaneously interpretable”, i.e. that for two nearby values c≈c′, the models M_c and M_c’ have similar weights and implement similar circuits.
My impression of what one is supposed to expect from this is that as the complexity parameter increases, the learned circuits quantitatively improve, but never undergo a radical qualitative shift at any particular scale. Would you agree with that? So for a “good circuit”, the explained loss is basically monotonic, slowly decreasing or stable around ~100% as one goes from complexity 0 to the cutoff complexity, and decreasing below that. But if there were a qualitative step change, you would see instead a peak in the explained loss around the cutoff complexity, increasing above and decreasing below that. In that situation, the loss precision scale would seem less natural as a measure of circuit understanding.
Basically, the concern would be something like a model implementing an algorithm with low algorithmic complexity but a large constant factor, so it can only emerge and become dominant at some critical model scale. (Similar to but not exactly the same as grokking.) One realistic possible instance of this might be the emergence of in-context learning in LLMs only at large enough scales.
Yeah I agree that it would be even more interesting to look at various complexity parameters. The inspiration here of course is physics: isolating a particle/effective particle (like a neutron in a nucleus) or an interaction between a fixed set of particles, by putting it in a regime where other interactions and groupings drop out. The goto for a physicist is temperature: you can isolate a neutron by putting the nucleus in a very high-temperature environment like a collider where the constituent baryons separate. This (as well as the behavior wrt generality) is the main reason I suggested for “natural degradation” from SLT, as this samples from the tempered distribution and is the most direct analog of varying temperature (putting stuff in a collider). But you can vary other hyperparameters as well. Probably an even more interesting thing to do is to simultaneously do two things with “opposite” behaviors, which I think is what you’re suggesting above. For a cartoon notions of the memorization-generalization “scale” is that if you have low complexity coming from low parameter count/depth or low training time (the latter often behaves similarly to low data diversity), you get simpler “more memorization-y” circuits (I’m planning to talk more about this later in a “learning stories” series—but from work on grokking, leap complexity, etc. people expect later solutions to generalize better. So if you combine this with the tempering “natural degradation” above, you might be able to get rid of behaviors both above and below a range of interest.
You’re right that tempering is not a binary on/off switch. Because of the nature of tempering, you do expect exponential decay of “inefficient” circuits as your temperature gets higher than the “characteristic temp.” of the circuit (this is analogous to how localized particles tend to have exponentially less coupling as they get separated), so it’s not completely unreasonable to “fully turn off” a class of behaviors. But something special in physics that probably doesn’t happen in AI is that the temperature scales relevant for different forces have very high separation (many orders of magnitude), so scales separate very clearly. In AI, I agree that as you described, tempering will only “partially” turn off many of the behaviors you want to clean up. It’s plausible that for simple circuits there is enough of a separation of characteristic temperature between the circuit and its interactions with other circuits that something approaching the behavior in physics is possible, but for most phenomena I’d guess that your “things decay more messily” picture is more likely.
The notion of a precision scale for interpretability is really interesting, particularly the connection with generalization/memorization. This seems like a fruitful concept to develop further.
It could be interesting to think about the interpretation of different possible complexity parameters here. You might expect these to give rise to distinct but related notions of generalization. Here, I’m drawing on intuitions from my work connecting scaling laws to data distributions, though I hadn’t put it in exactly these terms before. (I’ll write a LW post summarizing/advertising this paper, but haven’t gotten to it yet...)
One interesting scaling regime is scaling in effective model size (given infinite training data & compute). You can also think about scaling in training data size (given infinite model capacity & compute). I think model scaling is basically akin to what you’re talking about here. Data scaling could be useful to study too, as it gets around the need to understand & measure effective model capacity. Of course these are theoretical limits, in practice one usually scales model size & data together in a compute-optimal way, you’re probably not training to convergence, etc.
If the data distribution consists of clusters of varying size (alternatively, subtasks of varying importance), then taking model size as the complexity parameter could give a notion of generalization as modeling the most important parts of the data distribution. Memorization then consists of modeling rarely observed or unimportant components. On the other hand, taking data size as the complexity parameter would suggest that generalization consists of coarsely modeling the entire data distribution, with it being memorization-like to model fine details and exceptions anywhere.
It also would be interesting to think about other complexity scaling parameters, for example, test-time compute in an AlphaZero-style setting.
My impression of what one is supposed to expect from this is that as the complexity parameter increases, the learned circuits quantitatively improve, but never undergo a radical qualitative shift at any particular scale. Would you agree with that? So for a “good circuit”, the explained loss is basically monotonic, slowly decreasing or stable around ~100% as one goes from complexity 0 to the cutoff complexity, and decreasing below that. But if there were a qualitative step change, you would see instead a peak in the explained loss around the cutoff complexity, increasing above and decreasing below that. In that situation, the loss precision scale would seem less natural as a measure of circuit understanding.
Basically, the concern would be something like a model implementing an algorithm with low algorithmic complexity but a large constant factor, so it can only emerge and become dominant at some critical model scale. (Similar to but not exactly the same as grokking.) One realistic possible instance of this might be the emergence of in-context learning in LLMs only at large enough scales.
Yeah I agree that it would be even more interesting to look at various complexity parameters. The inspiration here of course is physics: isolating a particle/effective particle (like a neutron in a nucleus) or an interaction between a fixed set of particles, by putting it in a regime where other interactions and groupings drop out. The goto for a physicist is temperature: you can isolate a neutron by putting the nucleus in a very high-temperature environment like a collider where the constituent baryons separate. This (as well as the behavior wrt generality) is the main reason I suggested for “natural degradation” from SLT, as this samples from the tempered distribution and is the most direct analog of varying temperature (putting stuff in a collider). But you can vary other hyperparameters as well. Probably an even more interesting thing to do is to simultaneously do two things with “opposite” behaviors, which I think is what you’re suggesting above. For a cartoon notions of the memorization-generalization “scale” is that if you have low complexity coming from low parameter count/depth or low training time (the latter often behaves similarly to low data diversity), you get simpler “more memorization-y” circuits (I’m planning to talk more about this later in a “learning stories” series—but from work on grokking, leap complexity, etc. people expect later solutions to generalize better. So if you combine this with the tempering “natural degradation” above, you might be able to get rid of behaviors both above and below a range of interest.
You’re right that tempering is not a binary on/off switch. Because of the nature of tempering, you do expect exponential decay of “inefficient” circuits as your temperature gets higher than the “characteristic temp.” of the circuit (this is analogous to how localized particles tend to have exponentially less coupling as they get separated), so it’s not completely unreasonable to “fully turn off” a class of behaviors. But something special in physics that probably doesn’t happen in AI is that the temperature scales relevant for different forces have very high separation (many orders of magnitude), so scales separate very clearly. In AI, I agree that as you described, tempering will only “partially” turn off many of the behaviors you want to clean up. It’s plausible that for simple circuits there is enough of a separation of characteristic temperature between the circuit and its interactions with other circuits that something approaching the behavior in physics is possible, but for most phenomena I’d guess that your “things decay more messily” picture is more likely.