“Discontinuity” might suggest that this happens fast. Yet, e.g. in work on grokking, it actually turns out that these “sudden changes” happen over a majority of the training time (often, the x-axis is on a logarithmic scale). Is this compatible, or would this suggest that phenomena like grokking aren’t related to the phase transitions predicted by SLT?
This is a great question and something that come up at the recent summit. We would definitely say that the model is in two different phases before and after grokking (i.e. when the test error is flat), but it’s an interesting question to consider whats going on over that long period of time where the error is slowly decreasing. I imagine that it is a relatively large model (from an SLT point of view, which means not very large at all from normal ML pov), meaning there would be a plethora of different singularities in the loss landscape. My best guess is that it is undergoing many phase transitions across that entire period, where it is finding regions of lower and lower RLCT but equal accuracy. I expect there to be some work done in the next few months applying SLT to the grokking work.
As far as I know, modern transformers are often only trained once on each data sample, which should close the gap between SGD time and the number of data samples quite a bit. Do you agree with that perspective?
This is a very interesting point. I broadly agree with this and think it is worth thinking more about, and could be a very useful simplifying assumption in considering the connection between SGD and SLT.
In general, it seems to me that we’re probably most interested in phase transitions that happen across SGD time or with more data samples, whereas phase transitions related to other hyperparameters (for example, varying the truth as in your examples here) are maybe less crucial. Would you agree with that?
Broadly speaking, yes. With that said, hyperparameters in the model are probably interesting too (although maybe more from a capabilities standpoint). I think phase transitions in the truth are also probably interesting in the sense of dataset bias, i.e. what changes about a model’s behaviour when we include or exclude certain data? Worth noting here that the Toy Models of Superposition work explicitly deals in phase transitions in the truth, so there’s definitely a lot of value to be had from studying how variations in the truth induce phase transitions, and what these ramifications are in other things we care about.
Would you expect that most phase transitions in SGD time or the number of data samples are first-order transitions (as is the case when there is a loss-complexity tradeoff), or can you conceive of second-order phase transitions that might be relevant in that context as well?
At a first pass, one might say that second-order phase transitions correspond to something like the formation of circuits. I think there are definitely reasons to believe both happen during training.
Which altered the posterior geometry, but not that of K(w) since p(w|Dn)≈e−nK(w) (up to a normalisation factor).
I didn’t understand this footnote.
I just mean that K(w) is not affected by n (even though of course Kn(w) or Ln(w) is), but the posterior is still affected by n. So the phase transition merely concerns the posterior and not the loss landscape.
but the node-degeneracy and orientation-reversing symmetries only occur under precise configurations of the truth.
Hhm, I thought that these symmetries are about configurations of the parameter vector, irrespective of whether it is the “true” vector or not.
My use of the word “symmetry” here is probably a bit confusing and a hangover from my thesis. What I mean is, these two configurations are only in the set of true parameters in this setup when the truth is configured in a particular way. In other words, they are always local minima of K(w), but not always global minima. (This is what PT1 shows when 1.26≤θ<π2). Thanks for pointing this out.
It seems to me that in the other phase, the weights also annihilate each other, so the “non-weight annihilation phase” is a somewhat weird terminology. Or did I miss something?
Huh, I’d never really thought of this, but I now agree it is slightly weird terminology in some sense. I probably should have called them the weight-cancellation and non-weight-cancellation phases as I described in the reply to your DSLT3 comment. My bad. I think its a bit too late to change now, though.
I think there is a typo and you meant EWA.
Thanks! And thanks for reading all of the posts so thoroughly and helping clarify a few sloppy pieces of terminology and notation, I really appreciate it.
This is a great question and something that come up at the recent summit. We would definitely say that the model is in two different phases before and after grokking (i.e. when the test error is flat), but it’s an interesting question to consider whats going on over that long period of time where the error is slowly decreasing. I imagine that it is a relatively large model (from an SLT point of view, which means not very large at all from normal ML pov), meaning there would be a plethora of different singularities in the loss landscape. My best guess is that it is undergoing many phase transitions across that entire period, where it is finding regions of lower and lower RLCT but equal accuracy. I expect there to be some work done in the next few months applying SLT to the grokking work.
This is a very interesting point. I broadly agree with this and think it is worth thinking more about, and could be a very useful simplifying assumption in considering the connection between SGD and SLT.
Broadly speaking, yes. With that said, hyperparameters in the model are probably interesting too (although maybe more from a capabilities standpoint). I think phase transitions in the truth are also probably interesting in the sense of dataset bias, i.e. what changes about a model’s behaviour when we include or exclude certain data? Worth noting here that the Toy Models of Superposition work explicitly deals in phase transitions in the truth, so there’s definitely a lot of value to be had from studying how variations in the truth induce phase transitions, and what these ramifications are in other things we care about.
At a first pass, one might say that second-order phase transitions correspond to something like the formation of circuits. I think there are definitely reasons to believe both happen during training.
I just mean that K(w) is not affected by n (even though of course Kn(w) or Ln(w) is), but the posterior is still affected by n. So the phase transition merely concerns the posterior and not the loss landscape.
My use of the word “symmetry” here is probably a bit confusing and a hangover from my thesis. What I mean is, these two configurations are only in the set of true parameters in this setup when the truth is configured in a particular way. In other words, they are always local minima of K(w), but not always global minima. (This is what PT1 shows when 1.26≤θ<π2). Thanks for pointing this out.
Huh, I’d never really thought of this, but I now agree it is slightly weird terminology in some sense. I probably should have called them the weight-cancellation and non-weight-cancellation phases as I described in the reply to your DSLT3 comment. My bad. I think its a bit too late to change now, though.
Thanks! And thanks for reading all of the posts so thoroughly and helping clarify a few sloppy pieces of terminology and notation, I really appreciate it.