Thanks also for this post! I enjoy reading the sequence and look forward to post 5 on the connections to alignment :)
At some critical value θ=θc, we recognise a phase transition as being a discontinuous change in the free energy or one of its derivatives, for example the generalisation error Gn=E[Fn+1]−E[Fn].
“Discontinuity” might suggest that this happens fast. Yet, e.g. in work on grokking, it actually turns out that these “sudden changes” happen over a majority of the training time (often, the x-axis is on a logarithmic scale). Is this compatible, or would this suggest that phenomena like grokking aren’t related to the phase transitions predicted by SLT?
There is, however, one fundamentally different kind of “phase transition” that we cannot explain easily with SLT: a phase transition of SGD in time, i.e. the number gradient descent steps. The Bayesian framework of SLT does not really allow one to speak of time—the closest quantity is the number of datapoints n, but these are not equivalent. We leave this gap as one of the fundamental open questions of relating SLT to current deep learning practice.
As far as I know, modern transformers are often only trained once on each data sample, which should close the gap between SGD time and the number of data samples quite a bit. Do you agree with that perspective?
In general, it seems to me that we’re probably most interested in phase transitions that happen across SGD time or with more data samples, whereas phase transitions related to other hyperparameters (for example, varying the truth as in your examples here) are maybe less crucial. Would you agree with that?
Would you expect that most phase transitions in SGD time or the number of data samples are first-order transitions (as is the case when there is a loss-complexity tradeoff), or can you conceive of second-order phase transitions that might be relevant in that context as well?
Which altered the posterior geometry, but not that of K(w) since p(w|Dn)≈e−nK(w) (up to a normalisation factor).
I didn’t understand this footnote.
but the node-degeneracy and orientation-reversing symmetries only occur under precise configurations of the truth.
Hhm, I thought that these symmetries are about configurations of the parameter vector, irrespective of whether it is the “true” vector or not. Are you maybe trying to say the following? The truth determines which parameter vectors are preferred by the free energy, e.g. those close to the truth. For some truths, we will have more symmetries around the truth, and thus lower RLCT for regions preferred by the posterior.
We will use the label weight annihilation phase to refer to the configuration of nodes such that the weights all point into the centre region and annihilate one another.
It seems to me that in the other phase, the weights also annihilate each other, so the “non-weight annihilation phase” is a somewhat weird terminology. Or did I miss something?
The weight annihilation phase ENonWA is never preferred by the posterior
“Discontinuity” might suggest that this happens fast. Yet, e.g. in work on grokking, it actually turns out that these “sudden changes” happen over a majority of the training time (often, the x-axis is on a logarithmic scale). Is this compatible, or would this suggest that phenomena like grokking aren’t related to the phase transitions predicted by SLT?
This is a great question and something that come up at the recent summit. We would definitely say that the model is in two different phases before and after grokking (i.e. when the test error is flat), but it’s an interesting question to consider whats going on over that long period of time where the error is slowly decreasing. I imagine that it is a relatively large model (from an SLT point of view, which means not very large at all from normal ML pov), meaning there would be a plethora of different singularities in the loss landscape. My best guess is that it is undergoing many phase transitions across that entire period, where it is finding regions of lower and lower RLCT but equal accuracy. I expect there to be some work done in the next few months applying SLT to the grokking work.
As far as I know, modern transformers are often only trained once on each data sample, which should close the gap between SGD time and the number of data samples quite a bit. Do you agree with that perspective?
This is a very interesting point. I broadly agree with this and think it is worth thinking more about, and could be a very useful simplifying assumption in considering the connection between SGD and SLT.
In general, it seems to me that we’re probably most interested in phase transitions that happen across SGD time or with more data samples, whereas phase transitions related to other hyperparameters (for example, varying the truth as in your examples here) are maybe less crucial. Would you agree with that?
Broadly speaking, yes. With that said, hyperparameters in the model are probably interesting too (although maybe more from a capabilities standpoint). I think phase transitions in the truth are also probably interesting in the sense of dataset bias, i.e. what changes about a model’s behaviour when we include or exclude certain data? Worth noting here that the Toy Models of Superposition work explicitly deals in phase transitions in the truth, so there’s definitely a lot of value to be had from studying how variations in the truth induce phase transitions, and what these ramifications are in other things we care about.
Would you expect that most phase transitions in SGD time or the number of data samples are first-order transitions (as is the case when there is a loss-complexity tradeoff), or can you conceive of second-order phase transitions that might be relevant in that context as well?
At a first pass, one might say that second-order phase transitions correspond to something like the formation of circuits. I think there are definitely reasons to believe both happen during training.
Which altered the posterior geometry, but not that of K(w) since p(w|Dn)≈e−nK(w) (up to a normalisation factor).
I didn’t understand this footnote.
I just mean that K(w) is not affected by n (even though of course Kn(w) or Ln(w) is), but the posterior is still affected by n. So the phase transition merely concerns the posterior and not the loss landscape.
but the node-degeneracy and orientation-reversing symmetries only occur under precise configurations of the truth.
Hhm, I thought that these symmetries are about configurations of the parameter vector, irrespective of whether it is the “true” vector or not.
My use of the word “symmetry” here is probably a bit confusing and a hangover from my thesis. What I mean is, these two configurations are only in the set of true parameters in this setup when the truth is configured in a particular way. In other words, they are always local minima of K(w), but not always global minima. (This is what PT1 shows when 1.26≤θ<π2). Thanks for pointing this out.
It seems to me that in the other phase, the weights also annihilate each other, so the “non-weight annihilation phase” is a somewhat weird terminology. Or did I miss something?
Huh, I’d never really thought of this, but I now agree it is slightly weird terminology in some sense. I probably should have called them the weight-cancellation and non-weight-cancellation phases as I described in the reply to your DSLT3 comment. My bad. I think its a bit too late to change now, though.
I think there is a typo and you meant EWA.
Thanks! And thanks for reading all of the posts so thoroughly and helping clarify a few sloppy pieces of terminology and notation, I really appreciate it.
Thanks also for this post! I enjoy reading the sequence and look forward to post 5 on the connections to alignment :)
“Discontinuity” might suggest that this happens fast. Yet, e.g. in work on grokking, it actually turns out that these “sudden changes” happen over a majority of the training time (often, the x-axis is on a logarithmic scale). Is this compatible, or would this suggest that phenomena like grokking aren’t related to the phase transitions predicted by SLT?
As far as I know, modern transformers are often only trained once on each data sample, which should close the gap between SGD time and the number of data samples quite a bit. Do you agree with that perspective?
In general, it seems to me that we’re probably most interested in phase transitions that happen across SGD time or with more data samples, whereas phase transitions related to other hyperparameters (for example, varying the truth as in your examples here) are maybe less crucial. Would you agree with that?
Would you expect that most phase transitions in SGD time or the number of data samples are first-order transitions (as is the case when there is a loss-complexity tradeoff), or can you conceive of second-order phase transitions that might be relevant in that context as well?
I didn’t understand this footnote.
Hhm, I thought that these symmetries are about configurations of the parameter vector, irrespective of whether it is the “true” vector or not.
Are you maybe trying to say the following? The truth determines which parameter vectors are preferred by the free energy, e.g. those close to the truth. For some truths, we will have more symmetries around the truth, and thus lower RLCT for regions preferred by the posterior.
It seems to me that in the other phase, the weights also annihilate each other, so the “non-weight annihilation phase” is a somewhat weird terminology. Or did I miss something?
I think there is a typo and you meant EWA.
This is a great question and something that come up at the recent summit. We would definitely say that the model is in two different phases before and after grokking (i.e. when the test error is flat), but it’s an interesting question to consider whats going on over that long period of time where the error is slowly decreasing. I imagine that it is a relatively large model (from an SLT point of view, which means not very large at all from normal ML pov), meaning there would be a plethora of different singularities in the loss landscape. My best guess is that it is undergoing many phase transitions across that entire period, where it is finding regions of lower and lower RLCT but equal accuracy. I expect there to be some work done in the next few months applying SLT to the grokking work.
This is a very interesting point. I broadly agree with this and think it is worth thinking more about, and could be a very useful simplifying assumption in considering the connection between SGD and SLT.
Broadly speaking, yes. With that said, hyperparameters in the model are probably interesting too (although maybe more from a capabilities standpoint). I think phase transitions in the truth are also probably interesting in the sense of dataset bias, i.e. what changes about a model’s behaviour when we include or exclude certain data? Worth noting here that the Toy Models of Superposition work explicitly deals in phase transitions in the truth, so there’s definitely a lot of value to be had from studying how variations in the truth induce phase transitions, and what these ramifications are in other things we care about.
At a first pass, one might say that second-order phase transitions correspond to something like the formation of circuits. I think there are definitely reasons to believe both happen during training.
I just mean that K(w) is not affected by n (even though of course Kn(w) or Ln(w) is), but the posterior is still affected by n. So the phase transition merely concerns the posterior and not the loss landscape.
My use of the word “symmetry” here is probably a bit confusing and a hangover from my thesis. What I mean is, these two configurations are only in the set of true parameters in this setup when the truth is configured in a particular way. In other words, they are always local minima of K(w), but not always global minima. (This is what PT1 shows when 1.26≤θ<π2). Thanks for pointing this out.
Huh, I’d never really thought of this, but I now agree it is slightly weird terminology in some sense. I probably should have called them the weight-cancellation and non-weight-cancellation phases as I described in the reply to your DSLT3 comment. My bad. I think its a bit too late to change now, though.
Thanks! And thanks for reading all of the posts so thoroughly and helping clarify a few sloppy pieces of terminology and notation, I really appreciate it.