Great question, thanks. tldr it depends what you mean by established, probably the obstacle to establishing such a thing is lower than you think.
To clarify the two types of phase transitions involved here, in the terminology of Chen et al:
Bayesian phase transition in number of samples: as discussed in the post you link to in Liam’s sequence, where the concentration of the Bayesian posterior shifts suddenly from one region of parameter space to another, as the number of samples increased past some critical sample size n. There are also Bayesian phase transitions with respect to hyperparameters (such as variations in the true distribution) but those are not what we’re talking about here.
Dynamical phase transitions: the “backwards S-shaped loss curve”. I don’t believe there is an agreed-upon formal definition of what people mean by this kind of phase transition in the deep learning literature, but what we mean by it is that the SGD trajectory is for some time strongly influenced (e.g. in the neighbourhood of) a critical point w∗α and then strongly influenced by another critical point w∗β. In the clearest case there are two plateaus, the one with higher loss corresponding to the label α and the one with the lower loss corresponding to β. In larger systems there may not be a clear plateau (e.g. in the case of induction heads that you mention) but it may still reasonable to think of the trajectory as dominated by the critical points.
The former kind of phase transition is a first-order phase transition in the sense of statistical physics, once you relate the posterior to a Boltzmann distribution. The latter is a notion that belongs more to the theory of dynamical systems or potentially catastrophe theory. The link between these two notions is, as you say, not obvious.
However Singular Learning Theory (SLT) does provide a link, which we explore in Chen et al. SLT says that the phases of Bayesian learning are also dominated by critical points of the loss, and so you can ask whether a given dynamical phase transition α→β has “standing behind it” a Bayesian phase transition where at some critical sample size the posterior shifts from being concentrated near w∗α to being concentrated near w∗β.
It turns out that, at least for sufficiently large n, the only real obstruction to this Bayesian phase transition existing is that the local learning coefficient near w∗β should be higherthan near w∗α. This will be hard to prove theoretically in non-toy systems, but we can estimate the local learning coefficient, compare them, and thereby provide evidence that a Bayesian phase transition exists.
This has been done in the Toy Model of Superposition in Chen et al, and we’re in the process of looking at a range of larger systems including induction heads. We’re not ready to share those results yet, but I would point you to Nina Rimsky and Dmitry Vaintrob’s nice post on modular addition which I would say provides evidence for a Bayesian phase transition in that setting.
There are some caveats and details, that I can go into if you’re interested. I would say the existence of Bayesian phase transitions in non-toy neural networks is not established yet, but at this point I think we can be reasonably confident they exist.
The toy cases discussed in Multi-Component Learning and S-Curves are clearly dynamical phase transitions. (It’s easy to establish dynamical phase transitions based on just observation in general. And, in these cases we can verify this property holds for the corresponding differential equations (and step size is unimportant so differential equations are a good model).) Also, I speculate it’s easy to prove the existence of a bayesian phase transition in the number of samples for these toy cases given how simple they are.
Yes I think that’s right. I haven’t closely read the post you link to (but it’s interesting and I’m glad to have it brought to my attention, thanks) but it seems related to the kind of dynamical transitions we talk briefly about in the Related Works section of Chen et al.
Great question, thanks. tldr it depends what you mean by established, probably the obstacle to establishing such a thing is lower than you think.
To clarify the two types of phase transitions involved here, in the terminology of Chen et al:
Bayesian phase transition in number of samples: as discussed in the post you link to in Liam’s sequence, where the concentration of the Bayesian posterior shifts suddenly from one region of parameter space to another, as the number of samples increased past some critical sample size n. There are also Bayesian phase transitions with respect to hyperparameters (such as variations in the true distribution) but those are not what we’re talking about here.
Dynamical phase transitions: the “backwards S-shaped loss curve”. I don’t believe there is an agreed-upon formal definition of what people mean by this kind of phase transition in the deep learning literature, but what we mean by it is that the SGD trajectory is for some time strongly influenced (e.g. in the neighbourhood of) a critical point w∗α and then strongly influenced by another critical point w∗β. In the clearest case there are two plateaus, the one with higher loss corresponding to the label α and the one with the lower loss corresponding to β. In larger systems there may not be a clear plateau (e.g. in the case of induction heads that you mention) but it may still reasonable to think of the trajectory as dominated by the critical points.
The former kind of phase transition is a first-order phase transition in the sense of statistical physics, once you relate the posterior to a Boltzmann distribution. The latter is a notion that belongs more to the theory of dynamical systems or potentially catastrophe theory. The link between these two notions is, as you say, not obvious.
However Singular Learning Theory (SLT) does provide a link, which we explore in Chen et al. SLT says that the phases of Bayesian learning are also dominated by critical points of the loss, and so you can ask whether a given dynamical phase transition α→β has “standing behind it” a Bayesian phase transition where at some critical sample size the posterior shifts from being concentrated near w∗α to being concentrated near w∗β.
It turns out that, at least for sufficiently large n, the only real obstruction to this Bayesian phase transition existing is that the local learning coefficient near w∗β should be higher than near w∗α. This will be hard to prove theoretically in non-toy systems, but we can estimate the local learning coefficient, compare them, and thereby provide evidence that a Bayesian phase transition exists.
This has been done in the Toy Model of Superposition in Chen et al, and we’re in the process of looking at a range of larger systems including induction heads. We’re not ready to share those results yet, but I would point you to Nina Rimsky and Dmitry Vaintrob’s nice post on modular addition which I would say provides evidence for a Bayesian phase transition in that setting.
There are some caveats and details, that I can go into if you’re interested. I would say the existence of Bayesian phase transitions in non-toy neural networks is not established yet, but at this point I think we can be reasonably confident they exist.
Thanks for the detailed response!
So, to check my understanding:
The toy cases discussed in Multi-Component Learning and S-Curves are clearly dynamical phase transitions. (It’s easy to establish dynamical phase transitions based on just observation in general. And, in these cases we can verify this property holds for the corresponding differential equations (and step size is unimportant so differential equations are a good model).) Also, I speculate it’s easy to prove the existence of a bayesian phase transition in the number of samples for these toy cases given how simple they are.
Yes I think that’s right. I haven’t closely read the post you link to (but it’s interesting and I’m glad to have it brought to my attention, thanks) but it seems related to the kind of dynamical transitions we talk briefly about in the Related Works section of Chen et al.