I’ve recently spent a decent amount of time reading about Singular Learning Theory. It took some effort to understand what it’s all about (and I’m still learning), so I thought I’d write a short overview to my past self, in the hopes that it’d be useful to others.
There were a couple of very basic things it took me surprisingly long to understand, and which I tried to clarify here.
First, phase changes. I wasn’t, and still am not, a physicist, so I bounced off from physics motivations. I do have a math background, however, and Bayesian learning does allow for striking phase changes. Hence the example in the post.[1]
(Note: One shouldn’t think that because SGD is based on local updates, one doesn’t have sudden jumps there. Yes, one of course doesn’t have big jumps with respect to the Euclidean metric, but we never cared about that metric anyways. What we care about is sudden changes in higher-level properties, and those do occur in SGD. This is again something I took an embarrassingly long time to really grasp.)
Second, the learning coefficient. I had read about SLT for quite a while, and heard about the mysterious learning coefficient for quite a few times, before it was explained that it is just a measure of volume![2] A lot of things clicked to place: yes, obviously this is relevant for model selection, that’s why people talk about it.
(The situation is less clear for SGD, though: it doesn’t help that your basin is large if it isn’t reachable by local updates. Shrug.)
As implied in the post, I consider myself still a novice in SLT. I don’t have great answers to Alice’s questions at the end, and not all the technical aspects are crystal clear to me (and I’m intentionally not going deep in this post). But perhaps this type of exposition is best done by novices before the curse of knowledge starts hitting.
Thanks to everyone who proofread this and their encouragement.
In case you are wondering: the phase change I plotted is obtained via the Gaussian prior e−(12w)2 and the loss function (w2+10−2)((w−1)4+10−8). (Note that the loss is deterministic in w, which is unrealistic.)
Context for the post:
I’ve recently spent a decent amount of time reading about Singular Learning Theory. It took some effort to understand what it’s all about (and I’m still learning), so I thought I’d write a short overview to my past self, in the hopes that it’d be useful to others.
There were a couple of very basic things it took me surprisingly long to understand, and which I tried to clarify here.
First, phase changes. I wasn’t, and still am not, a physicist, so I bounced off from physics motivations. I do have a math background, however, and Bayesian learning does allow for striking phase changes. Hence the example in the post.[1]
(Note: One shouldn’t think that because SGD is based on local updates, one doesn’t have sudden jumps there. Yes, one of course doesn’t have big jumps with respect to the Euclidean metric, but we never cared about that metric anyways. What we care about is sudden changes in higher-level properties, and those do occur in SGD. This is again something I took an embarrassingly long time to really grasp.)
Second, the learning coefficient. I had read about SLT for quite a while, and heard about the mysterious learning coefficient for quite a few times, before it was explained that it is just a measure of volume![2] A lot of things clicked to place: yes, obviously this is relevant for model selection, that’s why people talk about it.
(The situation is less clear for SGD, though: it doesn’t help that your basin is large if it isn’t reachable by local updates. Shrug.)
As implied in the post, I consider myself still a novice in SLT. I don’t have great answers to Alice’s questions at the end, and not all the technical aspects are crystal clear to me (and I’m intentionally not going deep in this post). But perhaps this type of exposition is best done by novices before the curse of knowledge starts hitting.
Thanks to everyone who proofread this and their encouragement.
In case you are wondering: the phase change I plotted is obtained via the Gaussian prior e−(12w)2 and the loss function (w2+10−2)((w−1)4+10−8). (Note that the loss is deterministic in w, which is unrealistic.)
This example is kind of silly: I’m just making the best model at w≈1 have a very low prior, so it will only show it’s head after a lot of data. If you want non-silly examples, see the “Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition” or “The Developmental Landscape of In-Context Learning” papers.
You can also define it via the poles of a certain zeta function, but I thought this route wouldn’t be very illuminating to Alice.