What’s going on with deep learning? What sorts of models get learned, and what are the learning dynamics? Singular learning theory is a theory of Bayesian statistics broad enough in scope to encompass deep neural networks that may help answer these questions. In this episode, I speak with Daniel Murfet about this research program and what it tells us.
Topics we discuss:
In this transcript, to improve readability, first names are omitted from speaker tags.
Filan: Hello, everybody. In this episode, I’ll be speaking with Daniel Murfet, a researcher at the University of Melbourne studying singular learning theory. For links to what we’re discussing, you can check the description of this episode and you can read the transcripts at axrp.net. All right, well, welcome to AXRP.
Murfet: Yeah, thanks a lot.
What is singular learning theory?
Filan: Cool. So I guess we’re going to be talking about singular learning theory a lot during this podcast. So, what is singular learning theory?
Murfet: Singular learning theory is a subject in mathematics. You could think of it as a mathematical theory of Bayesian statistics that’s sufficiently general with sufficiently weak hypotheses to actually say non-trivial things about neural networks, which has been a problem for some approaches that you might call classical statistical learning theory. This is a subject that’s been developed by a Japanese mathematician, Sumio Watanabe, and his students and collaborators over the last 20 years. And we have been looking at it for three or four years now and trying to see what it can say about deep learning in the first instance and, more recently, alignment.
Filan: Sure. So what’s the difference between singular learning theory and classical statistical learning theory that makes it more relevant to deep learning?
Murfet: The “singular” in singular learning theory refers to a certain property of the class of models. In statistical learning theory, you typically have several mathematical objects involved. One would be a space of parameters, and then for each parameter you have a probability distribution, the model, over some other space, and you have a true distribution, which you’re attempting to model with that pair of parameters and models.
And in regular statistical learning theory, you have some important hypotheses. Those hypotheses are, firstly, that the map from parameters to models is injective, and secondly (quite similarly, but a little bit distinct technically) is that if you vary the parameter infinitesimally, the probability distribution it parameterizes also changes. This is technically the non-degeneracy of the Fisher information metric. But together these two conditions basically say that changing the parameter changes the distribution changes the model.
And so those two conditions together are in many of the major theorems that you’ll see when you learn statistics, things like the Cramér-Rao bound, many other things; asymptotic normality, which describes the fact that as you take more samples, your model tends to concentrate in a way that looks like a Gaussian distribution around the most likely parameter. So these are sort of basic ingredients in understanding how learning works in these kinds of parameterized models. But those hypotheses do not hold, it’s quite easy to see, for neural networks. I can go into more about why that is.
So the theorems just don’t hold. Now, you can attempt to make use of some of these ideas anyway, but if you want a thoroughgoing, deep theory that is Bayesian and describes the Bayesian learning process for neural networks, then you have to be proving theorems in the generality that singular learning theory is. So the “singular” refers to the breaking of these hypotheses. So the fact that the map from parameters to models is not injective, that means, in combination with this other statement about the Fisher information metric, that if you start at a neural network parameter, then there will always be directions you can vary that parameter without changing the input/output behavior, without changing the model.
Some of those directions are kind of boring, some of them are interesting, but that’s what singular learning theory is about: accommodating that phenomenon within the space of neural networks.
Filan: The way I’d understood it is that this basically comes down to symmetries in the neural network landscape. You can maybe scale down this neuron and scale up this neuron, and if neurons are the same, it doesn’t matter. But not only are there symmetries, there are non-generic symmetries.
Murfet: Correct. Yeah.
Filan: Because if there were just some symmetries, then maybe you could mod out by the symmetries… If you looked at the normal direction to the space at which you could vary things, then maybe that would be fine. So the way I’ve understood it is that there are certain parameter settings for neural networks where you can change it one way or you can change it another way, but you can’t change it in both directions at once. And there are other parameter settings where you can only change it in one of those two ways. So the fact that you can’t do them both at once means it’s not a nice, smooth manifold. And the fact that it’s different at different places means that it’s not this generic thing over the whole space. Some models are more symmetric than others and that ends up mattering.
Murfet: Yeah, I would say that’s mostly correct. I would say the word ‘symmetry’ is really not… I think I would also at a high level maybe use this word to a first approximation in explaining what’s going on, but it’s really not a sufficient concept. But yeah, it’s good to distinguish the kind of boring generic symmetries that come from the non-linearities. So in some sense, that’s why you can just look at a neural network and know that it’s singular because of these symmetries, like the ReLU scaling up the input and scaling down the output weights respectively will not change the behavior of the network. So that’s an obvious scaling symmetry, and that means that it’s degenerate and therefore a singular model.
But if that was all there was, then I agree: somehow that’s a boring technical thing that doesn’t seem like you really need, from a point of view of understanding the real phenomena, to care about it too much. But the reason that SLT [singular learning theory] is interesting is that, as you say, different regions of parameter space, you could say have different kinds of symmetries as a reflection of the different ways qualitatively in which they’re attempting to model the true distribution. But this other thing you mentioned, about being able to move in different directions, that’s not really symmetry so much as degeneracy.
So we could go more into conceptually why different regions or different kinds of solutions might have different kinds of degeneracy, but at a high level that’s right. Different kinds of solutions have different kinds of degeneracy, and so being able to talk about different kinds of degeneracy and how they trade off against one another, and why Bayesian learning might prefer more or less degenerate kinds of models, is the heart of SLT.
Filan: Sure. Before we go into that, what do you mean by “degeneracy”?
Murfet: Degeneracy just refers to this failure of the map from parameters to models to be injective. So “a degeneracy” would just mean a particular kind of way in which you could vary the neural network parameter, say in such a way that the input/output map doesn’t change. And as you were just mentioning, you might have, at one point, two or more essentially different ways in which you could vary the parameter without changing the loss function. And that is by definition what geometry is. So what I’m describing there with my hand is the level set of the loss function. It might be the minimal level set or some other level set, but if we’re talking about multiple ways I can change the neural network parameter without changing the loss, then I’m describing the configuration of different pieces of the level set of the loss function at that point. And that’s what geometry is about.
Filan: Sure. You mentioned that singular learning theory, or SLT for short, is very interested in different kinds of degeneracies. Can you tell us a little bit [about] what are the kinds of degeneracies, what different kinds of degeneracies might we see maybe in deep learning? And why does the difference matter?
Murfet: I think it’s easier to start with a case that isn’t deep learning, if that’s all right. Deep learning jumps straight into the deep end in terms of… and it’s also the thing which we understand least, perhaps. But if you imagine the easiest kind of loss functions… and when I say loss function, I typically mean “population loss”, not the empirical loss from a fixed dataset of finite size, but the average of that of all datasets. So that’s somehow the theoretical object whose geometry matters here, so I’ll flag that, and there’s some interesting subtleties there.
So in a typical case, in a regular statistical setting—not neural networks, but linear regression or something—the population loss looks like a sum of squares, so just a quadratic form. And there, minimizing it—I mean maybe with some coefficients, the level sets are ellipses—then the learning process just looks like moving down that potential well to the global minimum. And that’s kind of all that’s happening. So in that case, there’s no degeneracy. So there’s just one global minimum and you can’t vary it at all and still have zero loss.
A more interesting case would be where: suppose you have 10 variables, but a sum of eight squares, so x1² through x8². And then if you minimize that, well, you’ve still got two free parameters, so there’s a two-dimensional space of global minima of that function. Now imagine a population loss, and let’s only care about local minima, which has many local minima at various heights of the loss, each of which use different numbers of variables. So we suppose, for instance, that the global minimum maybe uses all 10, but then there’s a level set a bit higher than that that uses only nine squares, and a level set a bit higher than that that uses only eight squares. And so then those have different amounts of degeneracy.
So you have different points in the parameter space, loss landscape, where local minima have different degrees of degeneracy. And so then you can think about the competition between them in terms of trading off between preference for degeneracy versus preference for loss. And then we’re getting into key questions of if you’re a Bayesian, what kind of solution you prefer in terms of accuracy versus degeneracy.
Filan: And I guess this gets to this object that people talk about in singular learning theory called the “learning coefficient”. Can you tell us a little bit about what the learning coefficient is?
Murfet: In the case I was just describing, it’s easy to say what the learning coefficient is. There’s a distinction between global learning coefficient… Everything I say about SLT, more or less, is material that was introduced by Watanabe and written about in his books, and at some point, I guess we’ll talk about our contributions more recently. But mostly what I’m describing is not my own work, just to be clear.
So I’ll mostly talk about the local learning coefficient, which is a measure of degeneracy near a point in parameter space. If I take this example I was just sketching out: you imagine the global minimum level set and then some higher level sets. And I said that the population loss near the global minimum looked like a sum of 10 squares. And so the local learning coefficient of that would just be 10⁄2, so a half times the number of squares that you used.
So if there was a level set that had used only eight squares, then that’s degenerate, because you have two free directions, so it’s not a single isolated minimum, but rather a sort of two-dimensional plane of minimum. And each point of that two-dimensional plane would, because it locally looks like a sum of eight squares, have 8⁄2 as its local learning coefficient and so on. So if you use D’ squares in the local expression of your population loss, then your local learning coefficient is D’/2. That’s not how it’s defined: it has a definition, which we could get into various different ways of looking at it, but that’s what it cashes out to in those examples.
Filan: Sure. I guess the way to think about this local learning coefficient is that when it’s lower, that’s a solution that’s more degenerate. And the way I gather Bayesian inference works is that it tries to have both a low loss and also a low local learning coefficient. Does that sound right?
Murfet: Yep.
Filan: An image I often see in discussions of singular learning theory is people drawing doodles of trefoils and figure eights and maybe a circle to throw in there. The thing I often hear (as a caricature) is: initially you stay around the trefoil for a while, this is where you put your posterior mass, until at some point you get enough data and then you start preferring this figure eight, and then you get even more data and then you start preferring this circle, which has maybe even lower loss. So as you go down, maybe you get better loss, let’s just say, but the local learning coefficient is going to increase and therefore get worse.
Murfet: Maybe I’ll caveat that a little: the local learning coefficient is increasing, so you’re accepting a more complex solution in exchange for it being more accurate.
Phase transitions
Filan: Yeah. So that’s the very basic idea of singular learning theory. Why does it matter? What are the important differences between the singular learning theory picture and the classical statistical learning theory picture?
Murfet: In what context? Statistical learning theory in general, deep learning theory, alignment, or all three in that order?
Filan: Maybe all three in that order. I think I want to put off the discussion of alignment relevance for a little bit later until we just understand what’s going on with this whole thing.
Murfet: Okay. Yeah, I guess I didn’t actually come back to your question about the local learning coefficient in neural networks from earlier, but I think the cartoon in terms of sums of squares might still suffice for the moment.
If we talk about statistical learning theory in machine learning or deep learning in general, I think the main high-level conceptual takeaway from singular learning theory when you first encounter it should be that the learning process in Bayesian statistics really is very different for singular models. So let me define what I mean by “learning process”.
When we say “learning process” in deep learning, we tend to mean training by stochastic gradient descent. And what I’m saying is maybe related to that, but that’s a tricky point, so let me be clear that in Bayesian statistics, the “learning process” refers to: as you see more data, you change your opinion about what the relative likelihood of different parameters is. So you see more data, some parameters become ruled out by that data because they don’t give that data high probability, whereas other parameters become more likely. And what I’m describing is the Bayesian posterior, which assigns a probability to each parameter according to the data.
And so as you see more samples… I mean, if you’ve seen very few samples, you really have no idea which parameters are correct, so the posterior is very diffuse and will change a lot as you see more samples because you just are very ignorant. But asymptotic normality and regular statistical learning theory says that as you see more samples, that process starts to become more regular and concentrate around the true parameter in a way that looks like a Gaussian distribution.
So that’s in some sense a very simple process. But in singular models, that is not what happens, at least that’s not what’s predicted to happen by the theory. Until relatively recently, I think we didn’t have many very compelling examples of this in practice. But what the theory says is what you were describing earlier, that the Bayesian posterior should kind of jump as the trade-off between accuracy and complexity changes, which is a function of the number of samples. And those jumps move you from regions of qualitatively different solutions to other kinds of solutions, and then eventually maybe asymptotically to even choosing among perfect solutions depending on their complexity and then so on.
So there’s a very complicated, not very well-understood process underlying learning in Bayesian statistics for singular models, which as far as I know, Watanabe and his collaborators are the only people to ever really study. This is despite being somewhat old, in the sense that Watanabe and students and collaborators have been working on it for a while; it’s really not been studied in great depth outside of their group.
So [it’s] a very fundamental process in Bayesian statistics, relatively understudied, but arguably, at least if you take a Bayesian perspective, very central to how learning works in (say) neural networks, whether they’re artificial ones or even possibly biological ones.
So I think that’s the main thing. I mean, that’s not the only thing singular learning theory talks about. It’s not the only theoretical content, but I would say that’s the main thing I would want someone to know about the theory as it stands right now. The other thing is how that relates to generalization, but maybe I’ll pause there.
Filan: Sure. Maybe we should talk about that a bit. I hear people talk about this with the language of phase transitions. And I think upon hearing this, people might say, “Okay, if you look at loss curves of big neural nets that are being trained on language model data, the loss kind of goes down over time, and it doesn’t appear to be stuck at one level and then suddenly jump down to another level and then be flat and then suddenly jump down.” We have things which kind of look like that in toy settings, like grokking, like the development of induction heads, but it doesn’t generically happen. So should we think of these phase transitions as being relevant to actual deep learning, or are they just a theoretical curiosity about the Bayesian theory?
Murfet: Yeah, I think that’s a very reasonable question. I think a year ago, we ourselves were skeptical on this front. I think even in toy settings it wasn’t very clear that this theoretical prediction bears out. So maybe I’ll spend a moment to just be quite precise about the relationship between theory and practice in this particular place.
What the theory says is: asymptotically in N, the number of samples, a certain formula describing the posterior works, and then based on this formula, you can have the expectation that phase transitions happen. But in principle, you don’t know lower-order terms in the asymptotic, and there could be all sorts of shenanigans going on that mean that this phenomenon doesn’t actually occur in real systems, even toy ones. So theory on its own—I mean in physics or in machine learning or whatever—has its limits, because you can’t understand every ingredient in an asymptotic expansion. So even in toy settings, it was reasonable, I think, to have some skepticism about how common this phenomenon was or how important it was, even if the theory is quite beautiful.
Okay, so that aside, you go and you look in toy systems and you see this behavior, as we did, and then I think it’s reasonable to ask, “Well, okay, so maybe this happens in small systems, but not in large systems?” And indeed in learning curves, we don’t think we see a lot of structure.
So I’ll tell you what we know, and then what I think is going on. I should preface this by saying that actually we don’t know the answer to this question. So I think that it still remains unclear if this prediction about phases and phase transitions is actually relevant to very large models. We’re not certain about that. I would say there’s a reasonable case for thinking it is the case that it is relevant, but I want to be clear about what we know and don’t know.
Again, this is kind of an empirical question, because the theoretical situation under which phases and phase transitions exist… the theory stops at some point and doesn’t say much at the moment about this scale or that scale.
So what we know is that if you look at transformers around the scale of three million parameters, trained on language model datasets, you do see something like phases and phase transitions that basically describe… So again, what I’m about to describe is the learning process of training rather than seeing more samples. But the theoretical jump that we’re making here is to say, okay, if Bayesian statistics says certain kinds of structures in the model—if the theory says there should be qualitative changes in the nature of the way the posterior is describing which models are probable, if there are qualitative changes in that over the course of the Bayesian learning process, as you see more samples, then you might expect something similar when you go and look at seeing cumulatively more examples through the training process of stochastic gradient descent. But that is not a theoretically justified step at this point in some rigorous sense. That’s the kind of prediction you might make assuming some similarity between the learning processes, and then you can go in empirically and see if it’s true.
So if you go and look at language models at the scale of three million parameters… This is a recent paper that we did, Developmental Landscape of In-Context Learning. If you go and look at that, what you see [is] that the training process is divided into four or five stages, which have different qualitative content in a way that isn’t visible in the loss curve mostly.
Filan: It is a little bit visible.
Murfet: Yeah, I would agree with that. I mean, to the same extent that the induction bump is sort of visible in the original in-context learning and induction heads paper.
Filan: Yeah. I mean, it’s not obvious from the loss curve. It’s not like everybody already knew all the things that you guys found out.
Murfet: Yeah, I would say that without these other results, if you looked at the loss curve and tried to tell the story about these little bumps, it would feel like tea leaf reading. But once you know that the stages are there, yes, you can look at the loss curve and sort of believe in certain features of them.
So I mean, there’s various details about how you think about the relationship between those stages and phases and phase transitions in a sense of SLT. But I would say that’s still a very small model, but not a toy model, in which you do see something like stage-wise development.
And there are independent reasons… People have independently been talking about stage-wise development in learning systems outside of SLT. So I would say that the SLT story and stage-wise development as a general framing for how structure arrives inside self-organizing learning processes, that dovetails pretty well. So I would say that, to come back to your question about structure in the loss curve, just because nothing’s happening in the loss curve doesn’t mean that there isn’t structure arriving in stages within a model. And our preliminary results on GPT-2 Small at 160 million parameters: at a high level it has stages that look pretty similar to the ones in the three million parameters.
Filan: Interesting.
Murfet: So here’s my guess for what’s going on. It’s true that in very large models, the system is learning many things simultaneously, so you won’t see very sharp transitions except possibly if they’re very global things: [e.g.] switching to in-context learning as a mode of learning seems like it affects most of the things that a system is learning, so a qualitative change at that scale, maybe you would guess actually is represented sort of at the highest level and might even be visible in the loss curve, in the sense that everything is coordinated around that. There’s before and after.
But many other structures you might learn, while they’re developing somewhere else in the model, it’s memorizing the names of U.S. presidents or something, which just has nothing to do with structure X, Y, Z. And so in some sense, the loss curve can’t possibly hit a plateau, because even if it’s hitting a critical point for these other structures X, Y, Z, it’s steadily making progress memorizing the U.S. presidents. So there can’t be clear plateaus.
So the hypothesis has to be something like: if there is stage-wise development, which is reflected by these phases and phase transitions, it’s in some sense or another localized, maybe localized to subsets of the weights and maybe localized in some sense to certain parts of the data distribution. So the global phases or phase changes which touch every part of the model and affect every kind of input are probably relatively rare, but that isn’t the only kind of phase, phase transition, stage to which Bayesian statistics or SLT could apply.
Filan: Sure. Should I imagine these as being sort of singularities in a subspace of the model parameter space? The learning coefficient kind of picks them out in this subspace, but maybe not in the whole parameter space?
Murfet: Yeah, that’s kind of what we’re thinking. These questions are pushing into areas that we don’t understand, I would say. So I can speculate, but I want to be clear that some parts of this we’re rather certain of: the mathematical theory is very solid, the observation of the correspondence between the theory and Bayesian phase transitions in toy models is empirically and theoretically quite solid. This question of what’s happening in very large systems is a deep and difficult question. I mean, these are hard questions, but I think that’s right, that’s the motivation for… One of the things we’re currently doing is what we call weight-restricted local learning coefficients. This basically means you take one part of the model, say, a particular head, you freeze all the other weights…
Let me just give a more formal setting. When we’re talking about the posterior and the local learning coefficient and so on, we imagine a space of parameters. So there’s D dimensions or something. Some of those directions in parameter space belong to a particular head, and I want to take a parameter that, at some point in training, has some values for all these heads, I mean, for all these different weights, and I want to freeze all but the ones in the head and then treat that as a new model. Now, my model is I’m not allowed to change those weights, but I’m allowed to change the weights involved in the head, and I can think about the Bayesian posterior for that model and I can talk about its local learning coefficient.
That involves perturbing the parameter nearby that particular coefficient, but in a way where you only perturb the weights involved in that part of the structure, say, that head, and you can define the complexity of that local learning coefficient. That’s what we call the weight-restricted local learning coefficient. And then the hypothesis would be that, if a particular part of the model is specializing in particular kinds of structure and that structure is developing, then you’ll be at a critical point for some kind of restricted loss that is referring only to those weights, and that would show up.
We haven’t talked about how the local learning coefficient is used to talk about phase transitions, but that’s the experimental way in which you’d attempt to probe whether some part of the model is doing something interesting, undergoing a phase transition separately from other parts of the model.
Filan: Yeah, actually, maybe we should clarify that. How do you use the learning coefficient to figure out if a phase transition is happening?
Murfet: It depends on your background which answer to this question is most pleasant. For physics-y people who know about free energy, they’re familiar with the idea that various derivatives of the free energy should do something discontinuous at a phase transition, and you can think about the local learning coefficient as being something like that. So that, if there is a phase transition, then you might expect this number to change rapidly relative to the way it usually changes.
If we just stick within a statistical learning theory frame, we were laying out this picture earlier of: as you see more samples, the Bayesian posterior is concentrated in some region of parameter space and then rapidly shifts to be concentrated somewhere else, and the local learning coefficient is a statistic of samples from the Bayesian posterior, so if the Bayesian posterior shifts, then this number will also shift. The expectation would be that, if you measure this number, which it turns out you can do from many experiments, if you see that number change in some significant way, then it is perhaps evidence that some qualitative change in the posterior has occurred. That’s a way of detecting phase transitions which is, if you take this bridge from Bayesian statistics to statistical physics, pretty well justified I would say.
Estimating the local learning coefficient
Filan: Sure. A question about that: my understanding is that trying to actually measure the local learning coefficient involves taking a parameter setting and looking at a bunch of parameter settings nearby on all these dimensions that you could vary it, and measuring a bunch of properties, and this is the kind of thing that’s easy to do when you have a very low-dimensional parameter space corresponding to a small number of parameters. It seems like it’s going to be harder to do with a higher number of parameters in your neural networks. Just practically, how large a model can you efficiently measure local learning coefficient [for] at this time?
Murfet: Yeah. That’s a good question. I think it’s tricky. Maybe this will be a bit of an extended answer, but I think it’ll be better if I provide some context. When we first started looking at SLT, myself and my colleague here at the University of Melbourne, Susan Wei, and some other people… This was before… believe it or not, today there are 10x the number of people interested in SLT than there were back when we started thinking about it. It was an extremely niche subject, very deep and beautiful, but somewhat neglected.
Our question at that time was exactly this question. The theory says the local learning coefficient—the “real log canonical threshold” is another mathematical name for it—the theory says this is a very interesting invariant, but it’s very unclear if you can accurately estimate it in larger models. A lot of the theoretical development [involved using] one PhD student to compute the RLCT of one model theoretically, and you need some hardcore algebraic geometry to do that, et cetera, et cetera. The way the subject sat, it wasn’t clear that you could really be doing this at scale because it seems to depend on having very accurate samples from the posterior via Markov Chain Monte Carlo sampling or something.
I admit, I was actually extremely pessimistic when we first started looking at it that there really would be a future in which we’d be estimating RLCTs, or local learning coefficients, of a hundred million parameter models. So that’s where I started from. My colleague Susan and my PhD student Edmund Lau decided to try SGLD, stochastic gradient Langevin dynamics, which is an approximate Bayesian sampling procedure based on using gradients, to see how it worked. There’s a step in estimating the local learning coefficient where you need samples from the posterior. As you’re describing, this is famously difficult for large dimensional complex models.
However, there is a possible loophole, which is that… I mean, I don’t believe that anybody has a technique, nor probably ever will, for understanding or modeling very accurately the Bayesian posterior of very large-scale models like neural networks. I don’t think this is within scope, and I’m skeptical of anybody who pretends to have a method for doing that, hence why I was pessimistic about estimating the LLC [local learning coefficient] at scale because it’s an invariant of the Bayesian posterior which seems to have a lot of information about it and I believe it’s hard to acquire that information. The potential loophole is that maybe the local learning coefficient relies on relatively robust signals in the Bayesian posterior that are comparatively easy to extract compared to knowing all the structure.
That seems to be the world that we are in. To answer your question, Zach Furman and Edmund Lau just recently had a pre-print out where, using SGLD, it seems you can get relatively accurate estimates for the local learning coefficient for deep linear networks: products of matrices and known nonlinearities at scales up to a hundred million parameters.
Filan: A hundred million with an M?
Murfet: With an M, yeah. One should caveat that in several ways, but yeah.
Filan: Okay, and am I right that this is distinct from the “Quantifying degeneracy with the local learning coefficient” paper?
Murfet: That’s right. This is a second paper, a followup to that. I forget the title. I think it’s Estimating Local Learning Coefficient at Scale. So we wrote that paper a couple of years ago now, I think, looking at defining the local learning coefficient—which is implicit in Watanabe’s work, but we made it explicit—and making the observation that you could use approximate sampling to estimate it and then studying that in some simple settings, but it remained very unclear how accurate that was in larger models.
Now, the reason it’s difficult to go and test that is because we don’t know the true local learning coefficient for very many models that can be increased in some direction of scale. We know it for one hidden layer tanh networks and things like that. But some recent, very deep, interesting work by Professor Miki Aoyagi gives us the true value of the local learning coefficient for deep linear networks, which is why Zach and Edmund studied those. This was an opportunity to see if SGLD is garbage or not for this purpose.
I should flag that despite… How should I say this? SGLD is a very well-known technique for approximate Bayesian posterior sampling. I think everybody understands that you should be skeptical of how good those posterior samples are in some sense. It might be useful for some purpose, but you shouldn’t really view it as a universal solvent for your Bayesian posterior sampling needs or something. Just using SGLD doesn’t magically mean it’s going to work, so I would view it as quite surprising to me that it actually gives accurate estimates at scale for deep linear networks.
Now, having said that, deep linear networks are very special, and they are less degenerate in some important ways than real neural networks with nonlinearities, et cetera, so don’t take me as saying that we know that local learning coefficient estimation gives accurate values of the local learning coefficient for language models or something. We have basically no idea about that, but we know it’s accurate in deep linear networks.
Okay, so then what is generalizable about that observation? I think it leads us to believe that maybe estimating the LLC, SGLD is actually not garbage for that. How good it is we still don’t know, but maybe this cheap posterior sampling is still good enough to get you something interesting. And then the other thing is that: well, what you observe in cases where you know the true values is that, when the model undergoes phase transitions which exist in deep linear networks, as many people have… Maybe not in those exact terms, but, stage-wise development in deep linear networks has been studied for quite a long time, and you can see that this local learning coefficient estimator which is measuring the complexity of the current parameter during the learning process does jump in the way you would expect in a phase transition, when deep linear networks go through these phase transitions.
Well, it had to, because we know theoretically what’s happening to the geometry there. Those jumps in the local learning coefficient in other models, like these 3 million parameter language models or GPT-2 Small… when you go and estimate the local learning coefficient, you see it change in ways that are indicative of changes in internal structure. Now, we don’t know that the absolute values are correct when we do that, and most likely they’re not, but I think we believe in the changes in the local learning coefficient reflecting something real to a greater degree than we believe in the absolute values being real. Still, theoretically, I don’t know how we would ever get to a point where we would know the local learning coefficient estimation was accurate in larger models absent really fundamental theoretical improvements that I don’t see coming in the near term, but that’s where we are at the moment.
Singular learning theory and generalization
Filan: Fair enough. A while back, you mentioned the contributions of singular learning theory to understanding deep learning. There was something to do with phase transitions and there was also something to do with generalization, I think you mentioned. I want to ask you about that. Especially in the context of: I sometimes hear people say, “Oh, statistical learning theory says that model classes can have these parameters that have some degeneracy and that basically reduces their effective parameter count, and this just explains how generalization is possible.” This is the kind of story one can tell when one feels excitable, but it’s a bit more complicated. It’s going to depend on details of how these parameters actually translate into functions and what these degeneracies actually look like in terms of predictive models. What does singular learning theory tell us about generalization, particularly in the context of deep networks?
Murfet: Yeah. This is subtle. On its face, [in] singular learning theory, the theorems describe relations between loss, local landscape geometry, this local learning coefficient, and generalization error in the Bayesian sense. In the Bayesian sense, what I mean by generalization error is the KL divergence between the true distribution and the predictive distribution.
Maybe I should say briefly what the latter is. If you’re trying to make a prediction, if you’re talking about a conditional distribution, a prediction of Y given X, and you look at all the parameters that you’ve got for modeling that relationship, and you’re given an input and you take the prediction from every single model parameterized by your parameter space, you weight it with the probability given to that particular model by the Bayesian posterior and you average them all in that way, that’s the Bayesian predictive distribution. [It’s] obviously radically intractable to use that object or find that object. It’s a theoretical object. That probability distribution is probably not one that’s parameterized by parameters in your parameter space, but you can cook it up out of models in your parameter space. The KL divergence between that and the truth is the Bayesian generalization error.
Filan: The KL divergence just being a measure of how different probability distributions are.
Murfet: Right. That seems like a very theoretical object. There’s a closely related object, the Gibbs generalization error, which puts some expectations in different orders which is closer to what people in machine learning mean by “test error”—taking a parameter and trying it out on some samples from the true distribution that weren’t used to produce that parameter. There’s the various subtleties there. SLT, strictly speaking, only says things about those kinds of generalization errors and the relationship between that and test error for a parameter produced by a single run of SGD—well, I don’t even know that that is a mathematical object actually (test error for a parameter after a single run), but you can do things like talk about, for some distribution of SGD runs, what’s the expected test error.
Then there’s a gap between that Bayesian story and what you mean by “test error” in deep learning. This gap hasn’t been very systematically addressed, but I’ll lay out some story about how you might bridge that eventually in order to answer your question. If you believe that the Bayesian learning process ends with a distribution of parameters that look something like the endpoints of SGD training, or at least close enough, that something like this average of SGD runs of the test error looks a bit like averaging over things in the Bayesian posterior of some generalization quantity that makes sense in the Bayesian theory, then you could maybe draw some connection between these two things.
That hasn’t been done. I don’t know if that’s true, because these questions about relations between the Bayesian posterior and SGD are very tricky and I don’t think they look like they’re going to get solved soon, at least in my opinion. There’s a gap there. That’s one gap. We just paper over that gap and just say, “Okay. Well, fine, let’s accept that for the moment and just treat the generalization error that SLT says things about as being the kind of generalization error that we care about. What does SLT say?”
Maybe I’ll insert one more comment about that relationship between test error in deep learning and Bayesian generalization error first. This is a bit of a tangent, but I think it’s important to insert here. Various people, when looking to explain the inductive bias of stochastic gradient descent, have hit upon a phenomenon that happens in deep linear networks and similar systems, which is a stage-wise learning where the model moves through complexity in an increasing way.
We think about in deep linear networks—or what’s sometimes called matrix factorization, where you’re trying to use a product of matrices to model a single linear transformation—people have observed that, if you start with a small initialization, the model starts with low rank approximations to the true linear transformation and then finds a pretty good low rank approximation and then takes a step to try and use linear transformations of one higher rank and so on, and moves through the ranks in order to try and discover a good model. Now, if you believe that, then you would believe that, if SGD training is doing that, then it will tend to find the simplest solution that explains the data, because it’s searching them starting with simpler ones and only going to more complicated ones when it needs to.
Now, theoretically, that’s only known to happen… I mean, I think it’s not known to happen in deep linear networks rigorously speaking, but there’s expectations of that, [and] empirically, that happens, and there’s some partial theory. Then it’s a big leap to believe that for general SGD training of general neural networks, so I think we really don’t know that that’s the case in general deep learning. Believing that is pretty similar to believing something about the Bayesian learning process moving through regions of parameter space in order of increasing complexity as measured by the local learning coefficient. In fact, that is exactly what’s happening in the deep linear networks.
The SLT story about moving through the parameter space and the Bayesian posterior undergoing phase transitions is exactly what’s happening in the deep linear networks. If you’re willing to buy that generalization from that corner of theory of deep learning to general behavior of neural networks, then I think you are in some sense already buying the SLT story to some degree, [the story] of how learning is structured by looking for increasingly complex solutions. All of those are big question marks from a theoretical point of view, I would say.
Putting that aside, what does SLT say about generalization? Well, it says that the asymptotic behavior of the generalization error as a function of the number of samples at the very end of training, let’s say, or the very end of the Bayesian learning process, looks like the irreducible loss plus a term that looks like lambda/N, where lambda is the local learning coefficient. If you take that irreducible loss over the other side, the difference between generalization error and its minimum value behaves like 1/n, is proportional to 1/n, and the constant of proportionality is the local learning coefficient. That’s the deep role of this geometric invariant, this measure of complexity in the description of generalization error in the Bayesian setting.
Now, what that says in deep learning… as I said, taking that first part of that bridge between the two worlds for granted, it would like to say something like: the test error when you’re looking at a particular region of parameter space is governed by the local learning coefficient, except that the relation between N and training is unclear. The exact way in which it governs test error is a function of how that bridge gets resolved. I think, at a technical level, it’s difficult to say much precise at the moment. I don’t think it’s impossible. It’s just that very few people are working on this and it hasn’t been getting enough attention to say more concrete things.
At a conceptual level, it says that—and this maybe starts to get into more interesting future work you can do taking the SLT perspective—but this relationship between the local learning coefficient and how that is determined by loss landscape geometry and generalization behavior, this is a very interesting link which I think is quite fundamental and interesting.
I think your question is going in the direction of Joar Skalse’s LessWrong post. Is that right?
Filan: That’s what I was inspired by: just this question of, suppose we believe the story of, we’re gradually increasing complexity as measured by the local learning coefficient in this model class: well, what does that actually say in terms of objects that I cared about before I heard of singular learning theory? What’s that telling me in terms of things I care about, of the behavior of these things?
Murfet: It could tell you things like: suppose you know two solutions of your problem that are qualitatively different. You have a data-generating process and you can think about it in two different ways and, therefore, model it in two different ways. Potentially, if you could estimate the local learning coefficient or derive it or have some method of knowing that one is lower than the other, it could tell you things like one will be preferred by the Bayesian posterior.
Now, to the extent that that is related to what SGD finds, that might tell you that training is more likely to prefer some class of solutions to another class. Now, if those parameters are just very different, completely different solutions, somehow not nearby in parameter space, maybe it’s quite difficult to make the bridge between the way the Bayesian posterior would prefer one or the other and what training will do because, in that case, the relationship between training and these two parameters is this very global thing to do with the trajectory of training over large parts of the parameter space, and very difficult perhaps to translate into a Bayesian setting.
In cases where you have two relatively similar solutions, maybe you had a choice to make. So during the training process, you had one of two ways to take the next step and accommodate some additional feature of the true distribution, and those two different choices differed in some complexity fashion that could be measured by the local learning coefficient: one was more complex, but lowered the loss by so much, and the other one was simpler, but didn’t lower the loss quite as much. Then you could make qualitative predictions for what the Bayesian posterior would prefer to do, and then you could ask, “Are those predictions also what SGD does?” Either, theoretically, you could try and find arguments for why that is true, but it [also] gives you an empirical prediction you can go and test.
In this toy model of superposition work we did, SGD training does seem to do the thing that the Bayesian posterior wants to do. That’s very unclear in general, but it gives you pretty reasonable, grounded predictions that you might then go and test, which I think is not nothing. That would be, I think, the most grounded thing you’d do with the current state of things.
Filan: I guess it suggests a research program of trying to understand which kinds of solutions do have a lower learning coefficient, which kinds of solutions have higher learning coefficients, and just giving you a different handle on the problem of understanding what neural network training is going to produce. Does that seem fair?
Murfet: Yeah. I think, [for] a lot of these questions about the relation between the theory and practice, our perspective on them will shift once we get more empirical evidence. What I expect will happen is that these questions seem to loom rather large when we’ve got a lot of theory and not so much empirical evidence. If we go out and study many systems and we see local learning coefficients or restricted local learning coefficients doing various stage-wise things and they correspond very nicely to the structure that’s developing, as we can test independently with other metrics, then I think it will start to seem a little bit academic whether or not it’s provably the case that SGD training does the same thing as the Bayesian posterior just because this tool, which…
To be clear, the local learning coefficient, if you look at the definition, has a sensible interpretation in terms of what’s happening to the loss as you perturb certain weights, and you can tell a story about it, it doesn’t rely on the link between the Bayesian posterior and SGD training or something. To the degree that the empirical work succeeds, I think people will probably take this independent justification, so to speak, of the LLC as a quantity that is interesting, and think about it as a reflection of what’s happening to the internal structure of the model. Then, the mathematicians like myself will still be happy to go off and try and prove these things are justified, but I don’t see this as necessarily being a roadblock to using it quite extensively to study what’s happening during training.
Singular learning theory vs other deep learning theory
Filan: Fair enough. I’d like to ask some questions thinking about SLT as compared to other potential theoretical approaches one could have to deep learning. The first comparison I have is to neural tangent kernel-style approaches. The neural tangent kernel, for listeners who don’t know, is basically this observation that, in the limit of infinitely wide neural networks under a certain method of initializing networks, there’s this observation that networks, during training, the parameters don’t vary very much and, because the parameters don’t vary very much, that means you can do this mathematical trick. It turns out that your learning is basically a type of kernel learning, which is essentially linear regression on a set of features. Luckily, it turns out to be an infinite set of features and you can do it…
I don’t know how I was going to finish that sentence, but it turns out to be feature learning on this set of features, and you can figure out what those features are supposed to be based on what your model looks like, what kinds of nonlinearities you’re using. There’s some family of theory trying to understand: what does the neural tangent kernel of various types of models look like, how close are we to the neural tangent kernel?
And if you believe in the neural tangent kernel story, you can talk about: the reason that neural networks generalize is that the neural tangent kernel, it tends to learn certain kinds of features before other kinds of features, and maybe those kinds of features are simpler. It seems plausible that you could do some story about phase transitions, and it’s a mathematically rigorous story. So I’m wondering, how do you think the single learning theory approach of understanding deep learning compares to the neural tangent kernel-style approach?
Murfet: Yeah, good question. I think I’m not an expert enough on the NTK [neural tangent kernel] to give a very thorough comparison, but I’ll do my best. Let me say first the places in which I understand that the NTK says very deep and interesting things. It seems that this work on the mu parametrization seems very successful. At initialization, when this “taking the limit to infinite width” is quite justified because the weights really are independent, this seems like probably the principal success of deep learning theory, to the extent there are any successes: the study of that limit and how it allows you to choose hyperparameters for learning rates and other things. Again, I’m not an expert, but that’s my understanding of how it’s used, and that seems to be quite widely used in practice, as far as I know. So that’s been a great success of theory.
I don’t think I believe in statements outside of that initial phase of learning though. I think there, as far as I understand it, the claims to applicability of the NTK methods become hypotheses, unless you then perturb away from the Gaussian process limit. The deep parts of that literature seem to me to be accepting the position that in the infinite width limit, you get some Gaussian process that isn’t actually a good description of the training process away from initialization, but then you can perturb back in basically higher-order terms in the exponent of some distribution. You can put in higher-order terms and study systematically those terms to get back to finite width, attempt to perturb away from infinite width back to finite width and accommodate those contributions in some fashion. And you can do that with tools from random matrix theory and Gaussian processes.
And that looks a lot like what people do in Euclidean quantum field theory, and so people have been applying techniques from that world to do that. And I think they can say non-trivial things, but I think it is overselling it to say that that is a theory on the same level of mathematical rigor and depth as SLT. So I don’t think it says things about the Bayesian posterior and its asymptotics, in the way that SLT does, I think it’s aiming at rather different statements. And I think, at least in my judgment at the moment, it has a little bit of the flavor of saying qualitative things rather than quantitative things. Again, this is my outsider’s impression, and I could be wrong about what the state of things is there.
But I would say that one part of that story that I have looked at a little bit is the work that my colleague, Liam Hodgkinson has done here. They have some very interesting recent work on information criterion in over-parameterized models—I think the title is something like that. [It’s] partly inspired by Watanabe’s work, I think, looking at trying to take, not only NTK, but this general sort of approach, point of view to doing things like what the free energy formula in SLT does. And so I think that’s quite interesting. I have my differences of opinion with Liam about some aspects of that, but mathematics isn’t actually divided into camps that disagree with one another or something, right?
So if things are both true, then they meet somewhere. And I can easily imagine that… SLT is sort of made up of two pieces, one of which is using resolution of singularities to do Laplace integrals, oscillatory integrals, and the other is dealing with empirical processes that intervene in that when you try to put it in the context of statistics. And I don’t think these kinds of oscillatory integrals, these techniques, have been used systematically by the people doing NTK-like stuff or Euclidean field theory-like stuff, but I think that if you took those techniques and used them in the context of the random matrix theory that’s going on there, you’d probably find that the perturbations that they’re trying to do can be linked up with SLT somewhere. So I mean, I think it all probably fits together eventually, but right now they’re quite separated.
Filan: Fair enough. So a related question I have is: one observation I have, from the little I know about the deep learning theory literature, is the variance of the distribution of how parameters are initialized matters. So one example of this is in deep linear models. If your initialization distribution of parameters has high enough variance, then it looks something like the NTK: you only have a small distance until the optimum. Whereas if all the parameters are really, really close to zero at initialization, you have this jumping between saddle points. And in deep networks at one initialization, you have this neural tangent kernel story, which crucially doesn’t really involve learning features; it has a fixed set of features and you need decide which ones to use. If you differ the variance of the initialization, then you start doing feature learning, and that seems qualitatively different.
If I think about how I would translate that to a singular learning theory story… At least in general, when people talk about Bayesian stories of gradient descent, often people think of the prior as being the initialization distribution. And in the free energy formula of singular learning theory, the place where the loss comes up and then the learning coefficient comes up, the prior comes in at this order one term that matters not very much, basically.
Murfet: Well, late in training… I mean, late in the process it doesn’t matter.
Filan: Yeah. So I guess my question is: is singular learning theory going to have something to say about these initialization distribution effects?
Murfet: I haven’t thought about it at all, so this is really answering this question tabula rasa. I would say that from the asymptotic point of view, I guess we tend not to care about the prior, so this isn’t a question that we tend to think about too much so far, so that’s why I haven’t thought about it. But if you look at our model in the toy model of superposition, where you can really at least try and estimate order N term in the asymptotic, the log N term in the asymptotic, and then these lower order terms… And maybe I should say what this asymptotic is. If you take the Bayesian posterior probability that’s assigned to a region of parameter space and negative its logarithm (that’s an increasing function, so you could basically think about it as telling you how probable a given region is according to the posterior), you can give an asymptotic expansion for that in terms of N.
So for a large N, it looks like N times some number, which is kind of the average loss in that region or something like that, plus the local learning coefficient times log N plus lower order terms. The lower order terms we don’t understand very well, but there’s definitely a constant order term contributed from the integral of the prior over that region. Now if you look at the toy model of superposition, that constant order term is not insignificant at the scale of N at which we’re running our experiments. So it does have an influence, and I could easily imagine that this accounts for the kind of phenomena you’re talking about in DLNs [deep linear networks]. So a mathematician friend of mine, Simon Lehalleur, who’s an algebraic geometer who’s become SLT-pilled, maybe, has been looking at a lot of geometric questions in SLT and was asking me about this at some point.
And I guess I would speculate that if you just incorporated a constant term from those differences in initialization, that would account for this kind of effect. Maybe later in the year, we’ll write a paper about DLNs. At the moment, we don’t have complete understanding of the local learning coefficients away from the global minimum, the local learning coefficients of the level sets. I think we probably are close to understanding them, but there’s a bit of an obstacle to completely answering that question at the moment. But I think principle, that would be incorporated via the constant order term.
Which would, to be clear, not change the behavior at the very large N, but for some significant range of Ns, potentially including the ones you’re typically looking at in experiments, that constant order term could bias some regions against others in a way that explains the differences.
Filan: Yeah. And I guess there’s also a thing where the constant order term, in this case the expansion is: you’ve got this term times N, you’ve got this term times the logarithm of N, you’ve got this term times the logarithm of the logarithm of N, if I remember correctly?
Murfet: Yep.
Filan: And then you have these constant things. And the logarithm of the logarithm of N is very small, right, so it seems like kind of easy for the constant order term to be more important than that, and potentially as important as the logarithm of N?
Murfet: Yeah, although that log log N term is very tricky. So the multiplicity, Aoyagi’s proof… as I said, she understands deep linear networks, and in particular understands the multiplicity of the coefficient of this log log N term up to a −1. And this can get… if I remember correctly, as a function of the depth it has this kind of behavior and it becomes larger and larger [he mimes gradually increasing, ‘bouncing’ curves].
Filan: Like a bouncing behavior with larger bounces?
Murfet: Yeah, that’s right.
Filan: Interesting.
Murfet: Yeah, so that’s very wild and interesting. One of the things Simon is interested in is trying to understand [it] geometrically. Obviously Aoyagi’s proof is a geometric derivation of that quantity, but from a different perspective. Maybe Aoyagi has a very clear conceptual understanding of what this bouncing is about, but I don’t. So anyway, the log log N term remains a bit mysterious, but if you’re not varying the depth and you have a fixed depth, maybe it is indeed the case that the constant order terms could be playing a significant role.
Filan: Sure. Right. So I guess a final question I have before I get into the relationship between singular learning theory and existential risk from AI: I’m more familiar with work done applying singular learning theory to deep learning. Is there much work outside that, of the singular learning theory of all the things people do outside my department?
Murfet: Yes. I mean, that’s where the theory has been concentrated, I would say, so. I don’t want to give the impression that Watanabe didn’t think about neural networks; indeed, the class of models based on neural networks was one of the original motivations for him developing SLT. And he’s been talking about neural networks from the beginning, so early that the state of the art neural networks had tanh nonlinearities, so that’s how long Watanabe’s been talking about neural networks. Watanabe has been 20 years ahead of his time or something. But having said that, deeper neural networks with nonlinearities remain something that we don’t have a lot of theoretical knowledge about. There are some recent results giving upper bounds for various quantities, but in general, we don’t understand deeper neural networks in SLT.
The predominant theoretical work has been done for singular models that are not neural networks, various kinds of matrix factorization. There’s some interesting work by [Piotr] Zwiernik and collaborators looking at various kinds of graphical models, trees, deriving learning coefficients for probabilistic graphical models that have certain kinds of graphs. There’s papers on latent Dirichlet allocation, if that’s the correct expansion of the acronym LDA: many, many papers, dozens, I think. I wouldn’t be able to list all the relevant models here, but there’s quite a rich literature out there over the last several decades looking at other kinds of models.
How singular learning theory hit AI alignment
Filan: All right. So at this stage I’d like to move on to: my experience of singular learning theory is, I’m in this AI existential risk space. For a while, people are chugging along doing their own thing. Then at one Effective Altruism Global, I have this meeting with this guy called Jesse Hoogland who says, “Oh, I’m interested in this weird math theory.” And I tell him, “Oh yeah, that’s nice. Follow your dreams.” And then it seems like at some point in 2023, it’s all everyone’s talking about, singular learning theory, it’s the key to everything, we’re all going to do singular learning theory now, it’s going to be amazing. How did that happen? What’s the story whereby someone doing singular learning theory gets interested in AI alignment or the reverse?
Murfet: Yeah, I guess I can’t speak to the reverse so much, although I can try and channel Alexander [Gietelink Oldenziel] and Jesse [Hoogland] and Stan [van Wingerden] a little bit. I guess I can give a brief runthrough of my story. I cared about SLT before I cared about alignment, so maybe I’ll say briefly why I came to care about SLT. I’m an algebraic geometer by training, so I spent decades thinking about derived categories in algebraic geometry and some mathematical physics of string theory and its intersection with algebraic geometry, et cetera. And then I spent a number of years thinking about linear logic, which might seem unrelated to that, but has some geometric connections as well. And then because of some influence of friends and colleagues at UCLA where I was a postdoc, I paid attention to deep learning when it was taking off again in 2012, 2013, 2014. I’d always been a programmer and interested in computer science in various ways and sort of thought that was cool.
And then I saw AlphaGo happen, and then the original scaling laws paper from Hestness et al.. And it’s when I saw those two, AlphaGo and the Hestness et al. paper, that I was like, “huh, well maybe this isn’t just some interesting engineering thing, but maybe there’s actually some deep scientific content here that I might think about seriously, rather than just spectating on an interesting development somewhere else in the intellectual world.” So I cast around for ways of trying to get my hands on, with the mathematical tools that I had, what was going on in deep learning.
And that’s when I opened up Watanabe’s book, “Algebraic Geometry and Statistical Learning Theory”, which seemed designed to nerd-snipe me, because it was telling me geometry is useful for doing statistics. And then when I first opened it, I thought, that can’t possibly be true, this is some kind of crazy theory. And then I closed the book and put it away and looked at other things, and then came back to it eventually. So that’s my story of getting into SLT, from the point of view of wanting to understand universal mathematical phenomena in large-scale learning machines, and that’s my primary intellectual interest in the story. So I’ve been chugging away at that a little bit.
When I first started looking at SLT, it was—apart from Shaowei Lin, who did his PhD in SLT in the states, I believe, with Bernd Sturmfelds—mostly, it’s Watanabe, his students, and a few collaborators, mostly in Japan, a few people elsewhere, a very small community. So I was sitting here in Melbourne, chugging away reading this book and I had a few students, and then Alexander Oldenziel found me and asked me what this could say about alignment, if anything. And at the time, I found it very difficult to see that there was anything SLT could say about alignment, I guess, because as a mathematician, the parts of the alignment literature that I immediately found comprehensible were things like Vanessa Kosoy’s work or Scott Garrabrant’s work. These made sense to me, but they seemed quite far from statistical learning theory, at least the parts that I understood.
And so I think my answer originally to Alexander was, “no, I don’t think it is useful for alignment”, but reading more about the alignment problem and being already very familiar with capabilities progress, and believing that there was something deep and universal going on that that capabilities progress was sort of latching onto, but it not being some contingent phenomena on having a sequence of very complex engineering ideas, but more like “throw simple scaling and other things at this problem and things will continue to improve”. So that combination of believing in the capabilities progress and more deeply understanding what I was reading in the alignment literature about the problem… the product of that was me taking this problem seriously enough to think that maybe my initial answer, I could profit from thinking a little bit more extensively about it.
So I did that and outlined some of the ideas I had about how this kind of stage-wise learning, or phases and phase transitions that the Bayesian learning process and SLT talks about, how that might be by analogy with developmental biology used to understand how structure develops in neural networks. So I had some preliminary ideas around that [in the] middle of 2023, and those ideas were developed further by Alexander [Oldenziel] and Jesse Hoogland and Stan van Wingerden and various of my students and others, and that’s where this developmental interpretability agenda came from. And I think that’s sort of around the time you ran into SLT, if I remember correctly.
Filan: Yeah. The time I ran into it is: so, I hear a few different people mention it, including, if people listen to the episode of this podcast with Quintin Pope, he brings it up and it sounds interesting. And some other people bring it up, that sounds interesting. And then I hear that you guys are running some sort of summer school thing, a week where you can listen to lectures on single learning theory. And I’m like, “oh, I could take a week off to listen to some lectures, it seems kind of interesting”. This is summer of 2023. These lectures are still up on YouTube, so you can hear some guy ask kind of basic questions—that’s me.
Murfet: Yeah. I guess it took me a while to appreciate some of the things that… I mean, I guess John Wentworth has also been posting in various places how he sees SLT relating to some of the aspects of the alignment problem that he cares about. Now I see more clearly why some of the very core problems in alignment, things like sharp left turns and so on, the way that people conceptualize them… how SLT, when you first hear about it, might map onto that in a way that makes you think it could potentially be interesting.
I think my initial take being negative was mostly to do with it just being such a big gap at that time, the middle of last year, between SLT being a very highly theoretical topic…. I mean, I should be clear. The WBIC, which is the widely applicable Bayesian information criterion, which is a piece of mathematics and statistics that Watanabe developed, has been very widely used in places where the BIC [is used]. This is not an esoteric, weird mathematical object. This is a tool that statisticians use in the real world, as they say. The WBIC has been used in that way as well. And so the work we’ve been doing, with the local learning coefficient and SGLD and so on, is by far not the only place where SLT has met applications. That’s not the case. I don’t want to give that impression.
But the way SLT felt to me at that time was: there’s just so many questions about whether the Bayesian learning process is related to SGD training and all these other things we were discussing. So I think it was quite a speculative proposal to study the development process using these techniques, middle of last year. I think we’ve been hard at work over the last year seeing if a lot of those things pan out, and they seem to. So I think it’s much less speculative now to imagine that SLT says useful things, at least about stage-wise development in neural networks. I think it says more than that about questions of generalization that are alignment-relevant, but I think it was appropriate a year ago to think that there was some road to walk before it was clear that this piece of mathematics was not a nerd-snipe.
Filan: Sure. So at some point, this guy, Alex Oldenziel, reaches out to you and says, “hey, how is single learning theory relevant to alignment?” And instead of deleting that email, you spent some time thinking about it. Why?
Murfet: Well, I should insert a little anecdote here, which is I think I did ignore his first email, not because I read it and thought he was a lunatic, but just because I don’t always get to every email that’s sent to me. He persisted, to his credit.
Filan: Why did it feel interesting to you, or why did you end up pursuing the alignment angle?
Murfet: I had read some of this literature before in a sort of “curious but it’s not my department” kind of way. I quite extensively read Norbert Wiener’s work. I’m a big fan of Wiener, and he’s written extensively, in God & Golem and The Human Use of Human Beings and elsewhere, precisely about the control problem or alignment problem in much the same way as modern authors do. And so I guess I had thought about that and seen that as a pretty serious problem, but not pressing, because AI didn’t work. And then I suppose I came to believe that AI was going to work, in some sense, and held these two beliefs, but in different parts of my brain. And it was Alexander that sort of caused the cognitive dissonance, the resolution of which was me actually thinking more about this problem.
So that’s one aspect of it—just causing me to try and make my beliefs about things coherent. But I think that wouldn’t have been sufficient without a second ingredient, and the second ingredient was: to the degree you assign a probability to something like AGI happening in a relatively short period of time, it has to affect your motivational system for doing long-term fundamental work like mathematics.
So as a kind of personal comment, the reason I do mathematics is not based on some competitive spirit or trying to solve tricky problems or something like that. I am very much motivated as a mathematician by the image of some kind of collective effort of the human species to understand the world. And I’m not [Ed] Witten or [Maxim] Kontsevich or [Alexander] Grothendieck or somebody, but I’ll put my little brick in the wall. And if I don’t do it, then maybe it’ll be decades before somebody does this particular thing. So I’m moving that moment forward in time, and I feel like that’s a valid use of my energies and efforts, and I’ll teach other people and train students to do that kind of thing, and I felt that was a very worthwhile endeavor to spend my life professionally on.
But if you believe that there are going to be systems around in 10 years, 20 years, 30 years—it doesn’t really matter, right, because mathematics is such a long-term endeavor. If you believe that at some time, soon-ish, systems will be around that will do all that for $.05 of electricity and in 20 seconds… If that is your motivation for doing mathematics, it has to change your sense of how worthwhile that is, because it involves many tradeoffs against other things you could do and other things you find important.
So I actually found it quite difficult to continue doing the work I was doing, the more I thought about this and the more I believed in things like scaling laws and the fact that these systems do seem to understand what they’re doing, and there’s interesting internal structures and something going on we don’t understand. So I’d already begun shifting to studying the universal phenomena involved in learning machines from a geometric perspective, and I picked up statistics and empirical processes and all that. I’d already started to find that more motivating than the kind of mathematics I was doing before. And so it wasn’t such a big jump from that to being motivated by alignment and seeing a pathway to making use of that comparative advantage in theory and mathematics and seeing how that might be applicable to make a contribution to that problem.
There’s many details and many personal conversations with people that helped me to get to that point, and in particular, my former master’s student, Matt Farrugia-Roberts, who was in my orbit probably the person who cared about alignment the most, who I talked to the most about it. So that’s what led me to where I am now. Most of my research work is now motivated by applications to alignment.
Payoffs of singular learning theory for AI alignment
Filan: Sure. My next question is: concretely, what do you think it would look like for singular learning theory to be useful in the project of analyzing or preventing existential risk from AI?
Murfet: The pathway to doing that that we’re currently working on is providing some sort of rigorously founded empirical tools for understanding how structure gets into neural networks. And that has similar payoffs as many things [in] interpretability might, and also potentially some of the same drawbacks. So I can talk about that in more detail, but maybe it’s better to sketch out, at a very high level, the class of things that theories like SLT might say and which seem related to the core problems in alignment. Then we can talk about some detailed potential applications.
So I rather like the framing that Nate Soares gave in a blog post he wrote in 2022, I think. I don’t know if that’s the post that introduced the term “sharp left turn”, but it’s where I learned about it.
So let me give a framing of what Soares calls the core technical problem in alignment, and which I tend to agree seems like the core problem. I’ll say it in a way which I think captures what he’s saying but is my own language. If we look at the way that large-scale neural networks are developing, they become more and more competent with scale both in parameters and data, and it seems like there’s something kind of universal about that process. What exactly that is, we don’t quite know, but many models seem to learn quite similar representations, and there are consistencies across scale and across different runs of the training process that seem hard to explain if there isn’t something universal.
So then, what is in common between all these different training processes? Well, it’s the data. So I guess many people are coming to a belief that structure in the data, whatever that means, is quite strongly determinant of the structures that end up in trained networks, whatever you take that to mean, circuits or whatever you like.
So then from that point of view, what Soares says is… his terms are “capabilities generalize further than alignment”. And the way I would put that is: if your approach to alignment is engineering the data distribution—things like RLHF or safety fine-tuning and so on, [that] fundamentally look like training with modified data that tries to get the network to do the thing you want it to do; if we just take as a broad class of approaches “engineer the data distribution to try and arrange the resulting network to have properties you like” -
If that’s your approach, then you have to be rather concerned with which patterns in the data get written more deeply into the model, because if… And Soares’s example is arithmetic: if you look in the world, there are many patterns that are explained by arithmetic. I don’t think this is how current models learn arithmetic, but you could imagine future multimodal models just looking at many scenes in the world and learning to count and then learning rules of arithmetic, et cetera, et cetera.
So anyway, there are some patterns in the world that are very deep and fundamental and explain many different samples that you might see. And if this is a universal phenomenon, as I believe it is, that the data determines structure in the models, then patterns that are represented more deeply in the world will tend perhaps to get inscribed more deeply into the models. Now, that’s a theoretical question. So that’s one of the questions you might study from a theoretical lens. Is that actually the case?
But the story with DLNs [deep linear networks] and learning modes of the data distribution in order of their singular values and all that tends to suggest that this is on the right track. And I think SLT has something more general to say about that. I can come back to that later, but I buy this general perspective that in the data, there are patterns. Not all patterns are equal, some are more frequent than others, some are sort of deeper than others in the sense that they explain more. And capabilities—whatever that means, but reasoning and planning and the things that instrumental convergence wants to talk about models converging to—these kinds of things might be patterns that are very deeply represented.
Whereas the things you are inserting into the data distribution to get the models to do what you want, the kind of things that you’re doing with RLHF for example, might not be as primary as those other patterns, and therefore the way they get written into the model in the end might be more fragile. And then when there’s a large shift in the data distribution, say from training to deployment or however you want to think about that, how do you know which of those structures in your model, associated to which structures in the data distribution, are going to break and which ones will not? Which ones are sacrificed by the model in order to retain performance?
Well, maybe it’s the ones that are shallower rather than the ones that are deeper. And on that theory, capabilities generalize further than alignment. So I think that post is sometimes criticized by its emphasis on the evolutionary perspective, on the contrast between in-lifetime human behavior and what evolution is trying to get people to do and so on. But I think that’s missing the point to some degree. I think this general perspective of structure in the data determining structure in the models, not all structure being equal, and our alignment attempts, if they go through structuring the data, perhaps being out-competed by structures in the data that are deeper when it comes to what happens when data distributions shift—I think this is a very sensible, very grounded, quite deep perspective on this problem, which as a mathematician makes a lot of sense to me.
So I think this is a very clear identification of a fundamental problem in Bayesian statistics even absent a concern about alignment, but it does seem to me to be quite a serious problem if you’re attempting to do alignment by engineering the data distribution. So I think my mainline interest is in approaching that problem and, well, we can talk about how you might do that. Obviously it’s a difficult and deep problem empirically and theoretically, and so we’re sort of building up to that in various ways, but I think that is the core problem that needs to be solved.
Filan: Sure. I guess if you put it like that, it’s not obvious to me what it would look like for singular learning theory to address this, right? Maybe it suggests something about understanding patterns in data and which ones are more fundamental or not, but I don’t know, that’s a very rough guess.
Murfet: I can lay out a story of how that might look. Obviously, this is a motivating story, but not one that has a lot of support right now. I can say the ingredients that lead into me thinking that that story has some content to it.
So we’ve been studying for the last year how the training process looks in models of various sizes and what SLT says about that, and part of the reason for doing that is because we think… I mean, other people have independent reasons for thinking this, but from an SLT perspective, we think that the structure of the training process or learning process reflects the structure of the data, what things are in it, what’s important, what’s not. So if it’s correct that the structure of the data is somehow revealed in the structure of the learning process, and that also informs the internal structures in the model that emerge and then affect later structure and then are present in the final model.
So that starts to give you some insight into, [first], how—the mechanism by which structures in the data become structures in the model. If you don’t have that link, you can’t really do much. So if you can understand how structure in the data becomes structures—say, circuits or whatever—in the final model, that’s already something.
Then if you also understand the relative hierarchy of importance, how would you measure that? There’s several things you’d want to do in order to get at this question. You’d want to be able to, first of all, know what the structure in the data is. Well, unfortunately, training networks is probably the best way to find out what the structure in the data is. But suppose you’ve trained a network which sort of is a reflection, holding a mirror up to the data, and you get a bunch of structure in that model, well, then you’re just looking at a big list of circuits. How do you tell which kinds of structure are associated to deep things in the data, which are very robust and will survive under large scale perturbations, and [which are] very fragile structures that are somewhat less likely to survive perturbations in the data distribution if you had to keep training or expose the network to further learning.
Well, those are questions. Then there’s a question of stability of structure and how that relates to things you can measure, but these are fundamentally geometric questions from our point of view. So I think it actually is in scope for SLT to… Not right now, but there are directions of development of the theory of SLT that augment the invariants like the local learning coefficient and the singular fluctuation with other invariants you could attempt to estimate from data, which you could associate to these structures as you watch them emerging and which measures, for example, how robust they are to certain kinds of perturbations in the data distribution, so that you get some idea of not only what structure is in the model, but what is deep and what is shallow.
And how that pays off for alignment exactly, I guess it’s hard to say right now, but this seems like the kind of understanding you would need to have if you were to deal with this problem of generalization of capabilities outpacing alignment. If you were to have empirical and theoretical tools for talking about this sensibly, you’d at least have to do those things, it seems to me. So that’s how I would see concretely…
I mean, we have ideas for how to do all those things, but it’s still very early. The part that we sort of understand better is the correspondence between structure in the data and development, and the stages, and how those stages do have some geometric content. That’s what the changes in the local learning coefficient says. So all of that points in some direction that makes me think that the story I was just telling has some content to it, but that is the optimistic story of how SLT might be applied to solve eventually, or be part of the solution to [the alignment] problem, that we’re working towards.
Filan: Sure. So I guess if I think about what this looks like concretely, one version of it is this developmental interpretability-style approach of understanding: are there phase transitions in models? At what points do models really start learning a thing versus a different thing? And then I also see some work trying to think about what I would think of as inductive biases. So in particular, there’s this LessWrong post. Is that too undignified? I don’t know if you posted it elsewhere, but there’s this thing you posted about-
Murfet: Not undignified. Yes, it was a LessWrong post.
Filan: Something about, you call it “short versus simple”. Thinking about a singular learning theory perspective on learning codes of Turing machines that are generating data and saying something beyond just the number of symbols in the code. Perhaps you want to explain that a little bit more for the audience?
Murfet: Sure. There’s been an interesting thread within the alignment literature, I think, if I’m correct, going back to Christiano writing about ghosts in the Solomonoff prior or something. And then Evan Hubinger wrote quite a bit about this, and others, which is motivated by the observation that if you’re producing very capable systems by a dynamical process of training, and you want to prove things about the resulting process—or maybe that’s too ambitious, but at least understand something about the resulting process and its endpoint—then you might like to know what kind of things that process typically produces, which is what “inductive biases” means.
And neural networks are not Turing machines, but we have some understanding of certain kinds of distributions over Turing machine codes. And there’s a kind of Occam’s razor principle there, which is spiritually related to the free energy formula that we were discussing earlier, although not directly analogous without making some additional choices.
But anyway, the story about inductive biases and its role in alignment has been going on for a number of years, and there’s been, I think, quite reasonably some discussion that’s critical of that in recent months on LessWrong. And my post sort of came out of reading that a little bit. So let me maybe just characterize briefly what the discussion is for some context.
We don’t understand the inductive bias of SGD training. We know some bits and pieces, but we really don’t understand systematically what that bias is. We do not understand that it’s a bias towards low Kolmogorov complexity functions. There are some papers pointing in that direction. I don’t think they conclusively establish that. So I think we are just quite in the dark about what the inductive biases of SGD training are.
And I read these posts from, say, Christiano and Hubinger as saying, “Well, here we know about the inductive biases in some nearby conceptually similar thing. And if that knowledge could be used to reason about SGD training, then here would be the consequences. And these look potentially concerning from an alignment perspective.” And my model of both Christiano and Hubinger is that I think neither of them would claim those are ironclad arguments because there’s a big leap there, but it seems sufficient to motivate further research empirically, which is what, for example, Hubinger has been doing with the Sleeper Agents work.
So I think that’s very interesting, and I buy that, but with the big caveat that there is this gap there, that it isn’t on solid theoretical ground. And then you can criticize that work and say that it’s kind of spinning stories about how scary inductive biases are. And there were some posts from Nora Belrose and Quintin Pope critiquing the [argument, saying] if you take uncritically this story about inductive biases without really internalizing the fact that there is this big gap in there, then you might make overconfident claims about what the consequences of inductive biases may be.
So in some sense, I think both sides are correct. I think it’s reasonable to look at this and think, “Ah, this might tell us something, and so I’ll go away and do empirical work to see if that’s true.” I think it’s also accurate to think that people may have become a little bit overly spooked by our current understanding of inductive biases. So in that context, what I wanted to do with this post was to point out that as far as our current state-of-the-art knowledge about Bayesian statistics goes, which is SLT, at least if by “inductive bias” he means “which parameters does the Bayesian posterior prefer?”…
This is not description length. It’s not even like description length, it’s just something else. And we don’t know what that is yet. But this step that Christiano and Hubinger were making from thinking about description length and inductive biases in SGD training as maybe being related, I’m pointing to a particular piece of that gap where I see that this is not justified.
Now, I think that maybe the concern that they derive from that connection may still be justified, but I think thinking about it roughly as description length is simply wrong. And then I gave a particular example in that post—not in neural networks, but in a Turing machine-oriented setting—of how the local learning coefficient, which in some cases, like this simple situation we were describing at the beginning of this podcast, where you have energy levels and then there’s sums of squares, and the local learning coefficient is just the number of squares, which is sort of the co-dimension. So that’s somewhat like description length.
So if you have a system where the LLC, the local learning coefficient, is basically half the number of variables you need to specify your thing, then that is description length, because you take your universal Turing machine, it’s got a code tape, and you need n squares to specify your code. Well, that’s roughly speaking n variables whose value you need to specify, and you need that value to stay close to the value you specified and not wander off in order to execute the correct program.
So there is quite a legitimate rigorous connection between description length and the local learning coefficient in the case where you’re dealing with models that have this near-regularity behavior that the loss function is just locally sums of squares. But it’s typical, as soon as you perturb this kind of universal Turing machine perspective and introduce some stochasticity, that the local learning coefficient becomes immediately more exotic and includes, for example, a bias towards error correction, which I’d present in the following way.
If you give someone some instructions, it’s no good those instructions being short if they’re so fragile that they can’t execute them reliably. So there’s actually some advantage to trading off succinctness against robustness to errors in execution, where you don’t have to get everything perfect and you’ll still more or less get what you want. And there’s some precise mathematical statement of that in that post.
That’s in the setting of Turing machines, so it’s provably the case that there will be some preference for Turing machines, which are insensitive to certain kinds of errors if they’re executed in some slightly exotic way… The setting really is not meant to be thought of as directly analogous to what’s happening in neural networks. But I think there’s a high level of conceptual insight, which I sort of noticed after… I thought of those ideas along with my student, Will Troiani, at a meeting we had in Wytham that was organized by Alexander [Oldenziel] and Stan [van Wingerden] and Jesse [Hoogland].
There were some linear logic people there, and I was talking with them about this, and I had this idea with Will about error correction. And then later I twigged that there is a phenomenon in neural networks, these backup heads, where it does seem that neural networks may actually have a bias towards reliably computing important things by making sure that if some weight is perturbed in such a way that it takes out a certain head, that another head will compensate. So I’m speculating now, but when I see that sort of phenomenon, that makes sense to me, as a general principle of Bayesian statistics, that short is not necessarily better, degenerate is better, and degenerate can be both short but also redundant.
Filan: Right. So I guess to me this points to a qualitatively different way that singular learning theory could be useful, where one way is understanding developmental stages and how structure gets learned over time with data, and there’s this other approach which is better understanding what kinds of solutions Bayesian inference is going to prefer in these sorts of messy systems. And maybe that helps inform arguments that people tend to have about what sorts of nasty solutions should we expect to get. Does that seem fair to you?
Murfet: Yeah, I think so. I guess this observation about the inductive biases has sort of been on the side or something because we’ve been busy with other things. One of the things that my former student, Matt Farrugia-Roberts, who I mentioned earlier, and potentially others—I don’t know if Garrett Baker is interested in this, but he and Matt are working on an RL project right now that maybe eventually develops in this direction…
You could imagine that in a system that is doing reinforcement learning, that potentially some of these inductive biases—if they exist in neural networks, and that’s still speculation, but if this observation I’m making about this other setting with Turing machines, if this inductive bias towards error correction or robustness is universal, then you could imagine that this is actually a pretty significant factor in things like RL agents choosing certain kinds of solutions over others because they’re generally more robust to perturbations in their weights—things like making your environment safe for you to make mistakes. That’s speculation, but I do think that I agree that this is an independent direction in which potentially you can derive high-level principles from some of these mathematical ideas that would be useful.
Does singular learning theory advance AI capabilities?
Filan: Fair enough. So another question I have about this interplay between singular learning theory and AI alignment, AI existential risk is: a lot of people in the field use this kind of simplified model where there are some people working on making AI more generally capable and therefore more able to cause doom. And there are other people who are working on making sure AI doesn’t cause doom. And when you’re evaluating some piece of research, you’ve got to ask, to what extent does it advance capabilities versus alignment? And if it advances capabilities much more than alignment, then maybe you think it’s bad or you’re not very excited about it.
So with singular learning theory, one might make the critique that, well, if we have this better theory of deep learning, it seems like this is just going to generally be useful, and maybe it’s about as useful for causing doom as for preventing doom, or maybe it’s more useful for causing doom than for preventing doom, and therefore people on the anti-doom side should just steer clear of it. I’m wondering what you think about that kind of argument.
Murfet: Yeah, it’s a good question. I think it’s a very difficult question to think about properly. I have talked with many people about it. Not only on my own, but along with Alexander and Jesse and Stan and the other folks at Timaeus I’ve talked about this quite a bit. I talked with Lucius Bushnaq about it and some of the junior MIRI folks. So I’ve attempted to think about this pretty carefully, but I still remain very uncertain as to how to compute on these trade-offs, partly because especially this kind of research…
I mean, [in] empirical research, I suppose, you partly get out about as much as you put in or something. You have a certain number of experiments, you get a certain number of bits of insight. But theory sometimes doesn’t work like that. You crack something, and then lots and lots of things become visible. There’s a non-linear relationship between the piece of theory and the number of experiments it kind of explains. So my answer to this question could look extremely foolish just six months from now if a certain direction opens up, and then just very clearly the trade-off is not what I thought it was.
I guess one response to this question would be that we have prioritized thinking about directions within the theory that we think have a good trade-off in this direction. And for the things we’re currently thinking about, I just don’t see how the ratio of contribution to alignment to contribution to capabilities is too small to justify doing it. So we are thinking about it and taking it seriously, but I don’t actually have a very systematic way of dealing with this question, I would say, even at this point. But I think that applies to many things you might do on a technical front.
So I guess my model is something like… And here I think Alexander and I differ a little, so maybe I’ll introduce Alexander’s position just to provide context. So I think if you have a position that capabilities progress will get stuck somewhere—for example, perhaps it will get stuck… I mean, maybe the main way in which people imagine it might get stuck is that there’s some fundamental gap between the kind of reasoning that can be easily represented in current models and the kind of reasoning that we do, and that you need some genuine insight into something involved—architecture or training processes or data, whatever—to get you all the way to AGI. And there’s some threshold there, and that’s between us and the doom. If there is such a threshold, then conceivably, you get unstuck by having better theory of how universal learning machines work and the relationship between data and structure, and then you can reverse engineer that to design better architectures. So I guess that’s pretty obviously the mainline way in which SLT could have a negative impact. If, on the other hand, you think that basically not too much more is required, nothing deep, then it’s sort of like, capabilities are going to get there anyway, and the marginal negative contribution from doing more theoretical research seems not that important.
So I think that seems to me the major divide. I think in the latter world where you sort of see systems more or less getting to dangerous levels of capability without much deeper insight, then I think that SLT research, I’m not that concerned about it. I think just broadly, one should still be careful and maybe not prioritize certain avenues of investigation that seem disproportionately potentially likely to contribute to capabilities. But on the whole, I think it doesn’t feel that risky to me. In the former case where there really is going to be a threshold that needs to be cracked with more theoretical progress, then it’s more mixed.
I guess I would like to err on the side of… Well, my model is something like it would be extremely embarrassing to get to the point of facing doom and then be handed the solution sheet, which showed that actually it wasn’t that difficult to avert. You just needed some reasonably small number of people to think hard about something for a few years. That seems pretty pathetic and we don’t know that we’re not in that situation. I mean, as Soares was saying in this post, he also, at least at that time, thought it wasn’t like alignment was impossible, but rather just a very difficult problem you need a lot of people thinking hard about for some period of time to solve, and it seems to me we should try. And absent a very strong argument for why it’s really dangerous to try, I think we should go ahead and try. But I think if we do hit a plateau and it does seem like theoretical progress is likely to critically contribute to unlocking that, I think we would have to reevaluate that trade-off.
Filan: Yeah. I wonder: it seems like you care both about whether there’s some sort of theoretical blocker on the capabilities side and also whether there’s some theoretical blocker on the alignment side, right?
Murfet: Yeah.
Filan: If there’s one on the alignment side but not on the capabilities side, then you’re really interested in theory. If there’s one on the capability side but not on the alignment side, then you want to erase knowledge of linear algebra from the world or something. Not really. And then if there’s both or neither, then you’ve got to think harder about relative rates. I guess that would be my guess?
Murfet: Yeah, I think that’s a nice way of putting it. I think the evidence so far is that the capabilities progress requires essentially no theory, whereas alignment progress seems to, so far, not have benefited tremendously from empirical work. I mean, I guess it’s fair to say that the big labs are pushing hard on that and believe in that, and I don’t know that they’re wrong about that. But my suspicion is that these are two different kinds of problems, and I do see this as actually a bit of a groupthink error in my view, in the more prosaic alignment strategy, which is: I think a lot of people in computer science and related fields think, maybe not consciously, but unconsciously feel like deep learning has succeeded because humans are clever and we’ve made the things work or something.
I think many clever people have been involved, but I don’t think it worked because people were clever. I think it worked because it was, in some sense, easy. I think that large scale learning machines want to work and if you just do some relatively sensible things… Not to undersell the contributions of all the people in deep learning, and I have a lot of respect for them, but compared to… I mean, I’ve worked in deep areas of mathematics and also in collaboration with physicists, the depth of the theory and understanding required to unlock certain advances in those fields, we’re not talking about that level of complexity and depth and difficulty when we’re talking about progress in deep learning.
Filan: I don’t know, I have this impression that the view that machines just want to learn and you just have to figure out some way of getting gradients to flow. This seems similar to the Bitter Lesson essay. To me, this perspective is… I feel like I see it in computer scientists, in deep learning people.
Murfet: Mm-hmm. Yeah. But I think that the confidence derived from having made that work seems like it may lead to a kind of underestimation of the difficulty of the alignment problem. If you think about, “Look, we really cracked deep learning as a capabilities problem and surely alignment is quite similar to that. And therefore because we’re very clever and have lots of resources and we really nailed this problem, therefore we will make a lot of progress on that problem.” That may be true, but it doesn’t seem like it’s an inference that you can make, to me. So I guess I do incline towards thinking that alignment is actually a different kind of problem, potentially, to making the thing work in the first place.
And this is quite similar to the view that I was attributing to Soares earlier, and I think there are good reasons, fundamental reasons from the view of statistics or whatever to think that that might be the case. I think it’s not just a guess. I do believe that they are different kinds of problems, and therefore that has a bearing on the relative importance of… I do think alignment may be theoretically blocked, because it is a kind of problem that you may need theoretical progress for. Now, what does that mean? If we look at the empirical approaches to alignment that are happening in the big labs, and they seem to really be making significant contributions to the core problems of alignment, and at the same time capabilities sort of seem blocked, then I guess that does necessarily mean that I would move against my view on the relative value of theoretical progress, because it might not be necessary for alignment, but might unblock capabilities progress or something.
Filan: Yeah. For what it’s worth, I think, at least for many people, I get the impression that the “optimism about prosaic alignment” thing maybe comes more from this idea that somehow the key to alignment is in the data and we’ve just got to figure out a way to tap into it, rather than “we’re all very smart and we can solve hard problems, and alignment’s just as hard as making capabilities work.” This is my interpretation of what people like Nora Belrose, Quintin Pope, Matthew Barnett think. They’re welcome to correct me, I might be misrepresenting them. I guess there’s also a point of view of people like Yann LeCun who think that we’re not going to have things that are very agentic, so we don’t need to worry about it. Maybe that is kind of a different perspective.
Open problems in singular learning theory for AI alignment
Filan: So changing topics a bit: suppose someone has listened to this podcast and they’re interested in this research program of developing singular learning theory, making it useful for AI alignment things: what are the open problems or the open research directions that they could potentially tap into?
Murfet: I’ll name a few, but there is a list on the DevInterp webpage. If you go to DevInterp, there’s an “open problems” page and there’s a Discord there where this question gets asked fairly frequently and you’ll find some replies.
Maybe there are several different categories of things which are more or less suited to people with different kinds of backgrounds. I think there already are, and will be an increasing number of, people coming from pure mathematics or rather theoretical ends of physics who ask this question. To them, I have different answers to people coming from ML or computer science, so maybe I’ll start with the more concrete end and then move into the more abstract end.
So on the concrete front, the current central tool in developmental interpretability is local learning coefficient estimation. I mentioned that this work that Zach [Furman] and Edmond [Lau] did gives us some confidence in those estimates for deep linear networks. But there is a lot of expertise out there in approximate Bayesian sampling from people in probabilistic programming to just Bayesian statistics in general. And I think a lot more could be done to understand the question of why SGLD is working to the extent it works. There was a recent deep learning theory conference in Lorne, organized by my colleague, Susan [Wei] and Peter Bartlett at DeepMind, and I posed this as an open problem there. I think it’s a good problem. So the original paper that introduced SGLD has a kind of proof that it should be a good sampler, but this proof… Well, I wouldn’t say it’s actually a proof of what you informally mean when you say SGLD works. So I would say it’s actually a mystery why SGLD is accurately sampling the LLC, even in deep linear networks.
Understanding that would give us some clue as to how to improve it or understand what it’s doing more generally. And this kind of scalable approximate Bayesian sampling will be fundamental to many other things we’ll do in the future with SLT. So if we want to understand more about the learned structure in neural networks, how the local geometry relates to this structure of circuits, et cetera, et cetera, all of that will at the bottom rely on better and better understanding of these approximate sampling techniques. So I would say there’s a large class of important fundamental questions to do with that.
A second class of questions, more empirically, is studying stagewise development in more systems, taking the kind of toolkit that we’ve now developed and applied to deep linear networks, to the toy model of superposition and small transformers, just running that on different systems. We had some MATS scholars, Cindy Wu and Garrett Baker and Xinyu Qian looking at this recently, and there’s a lot more in that direction one can do. I think those are sort of the main [categories]. Beyond that, maybe I’ll defer to the list of open problems on the webpage and talk about some more intermediate questions.
So there’s a lot more people at the moment with ML backgrounds interested in developmental interpretability than there are with the kind of mathematical backgrounds that would be required to do more translation work. At the moment, there are various other things in SLT, like the singular fluctuation, which we haven’t been using extensively yet, but which we’re starting to use. And I know there’s a PhD student of [Pratik] Chaudhari who’s investigating it and maybe a few others. But this is the other principal invariant besides the learning coefficient in SLT, which should also tell us something interesting about development and structure, but which hasn’t been extensively used yet. So that’s another interesting direction. Of course you can just take quantities and go and empirically use them, but then there’s questions… using the local learning coefficient, there’s some subtleties, like the role of the inverse temperature and so on.
And there are theoretical answers to the question, like, “Is it okay for me to do X?” When you’re doing local learning coefficient estimation, are you allowed to use a different inverse temperature? Well, it turns out you are, but the reason for that has some theoretical basis and there is a lower set of people who can look at the theory and know that it’s justified to do X. So if you have a bit more of a mathematical background, helping to lay out more foundations, knowing which things are sensible to do with these quantities is important. Singular fluctuation is one.
Then ranging through to the more theoretical, at the moment, it’s basically Simon and myself and my PhD student, Zhongtian [Chen], who have a strong background in geometry and they were working on SLT, Simon Lehalleur, as I mentioned earlier. Currently, a big problem with SLT is that it makes use of the resolution of singularities to do a lot of these integrals, but that resolution of singularities procedure is kind of hardcore or something. It’s a little bit hard to extract intuition from. So we do have an alternative perspective on the core geometry going on there based on something called jet schemes, which has a much more dynamical flavor and Simon’s been working on that and Zhongtian as well a little bit.
So I would say we’re maybe a few months away from having a pretty good starting point from anybody who has a geometric background to see ways to contribute to it. So the jet scheme story should feed into some of this discussion around stability of structures to data distribution shift that I was mentioning earlier. There’s lots of interesting theoretical open problems there to do with deformation of singularities that should have a bearing on basic questions in data distribution change in Bayesian statistics. So that’s a sketch of some of the open directions. But relative to the number of things to be done, there are very few people working on this. So if you want to work on this, show up in the Discord or DM me or email me and ask this question, and then I will ask what your background is and I will provide a more detailed answer.
What is the singular fluctuation?
Filan: Sure. At the risk of getting sucked down a bit of a rabbit hole, the singular fluctuation… I noticed that in this paper, Quantifying Degeneracy, it’s one of the two things you develop an estimator for. Maybe I should just read that paper more clearly, but I don’t understand what the point of this one is. The local learning coefficient, we’re supposed to care about it because it shows up in the free energy expansion and that’s all great. What is the singular fluctuation? Why should I care about it?
Murfet: Okay, I’ll give two answers. The relation between them is in the mathematics and maybe not so clear. The first answer, which is I think the answer Watanabe would give, or rather the gray book would give, is that, if you look at the gap between… We were talking earlier about the theoretical generalization error, the KL divergence from the truth to the predictive distribution, which is some theoretical object, you’ll never know what that is. So you’re interested then in the gap between that and something you can actually estimate, which you can call the training error. It’s what Watanabe calls the training error. I think one should not conflate that with some other meaning of training error that you might have in mind. Anyway, it’s some form of generalization error, which can be estimated from samples. So if you can understand that gap, then obviously you can understand the theoretical object. And that gap is described by a theorem in terms of the learning coefficient and the singular fluctuation.
So the singular fluctuation controls the gap between these theoretical and empirical quantities, is one way of thinking about it. So that is its theoretical significance. It’s much less understood. Watanabe flags in a few different places that this is something he would be particularly interested in people studying. For example, we don’t know bounds on it in the way that we might know bounds on the local learning coefficient. You can estimate it from samples in a similar way. We don’t have any results saying that estimates based on SGLD are accurate or something because we don’t have… I mean, those depend on knowing theoretical values, which are much less known in general than learning coefficient values.
The second answer to what the singular fluctuation is, is that it tells you something about the correlation between losses for various data samples. So if you take a fixed parameter and you look at some data set, it’s got N things in it, N samples. Then you can look at the loss for each sample, whose average is the empirical loss.
So for the i-th sample, you can take Li, which is the loss of that parameter on that sample, but if you think about the parameter as being sampled from the Bayesian posterior locally, that’s a random variable that depends on W, the parameter. And then you can take the covariance matrix of those expectations with respect to all the different samples: EW of loss i times loss j, where the losses depend on the parameter, which is sampled from the posterior. And that covariance matrix is related to the singular fluctuation.
So it’s quite closely related to things like influence functions, or how sensitive the posterior is for including or leaving out certain samples, or leverage samples, or these kinds of notions from statistics. So it’s a kind of measure of how influential… Well, yeah, so it’s that covariance matrix. We think that this can be a tool for understanding more fine-grained structure than the local learning coefficient or correlation functions in that direction: not only correlation functions of two values like that, but more… So this is going in the direction of extracting more fine-grained information from the posterior than you’re getting with the local learning coefficient, at some conceptual level.
How geometry relates to information
Filan: Sure. Gotcha. So before we basically wrap up, is there any question that you wish I’d asked during this interview, but that I have not yet asked?
Murfet: Well, how about a question you did ask but I didn’t answer? We can circle back to: you asked me, I think, at some point, about how to think about the local learning coefficient for neural networks, and then I told some story about a simplified setting. So maybe I’ll just briefly come back to that. So if you think about, given an architecture and given data, the loss function represents constraints. It represents a constraint for certain parameters to represent certain relationships between inputs and outputs. And the more constraints you impose, somehow the closer you get to some particular kind of underlying constraint. So that’s what the population loss is telling you.
But if you think about, “Okay, so what are constraints?”: constraints are equations, and there’s several ways of combining equations. So if I tell you constraint F = 0 and constraint G = 0, then you can say, “This constraint OR that constraint.” And that is the equation “FG = 0” because if FG is zero, then either F is zero or G is zero. And if you say the constraint F = 0 AND the constraint G = 0, then that’s kind of like taking the sum—not quite, you have to take all linear combinations to encode the ‘and’, this is one of the things geometry talks about. That would be taking the ideal generated by F and G. But basically, taking two constraints and taking their conjunction means something like taking their sum.
So that gives you a vision of how you might take a very complex constraint, an overall constraint, say one that’s exhibited by the population loss, the constraint implicit in which is all the structure in your data. It’s a very hard set of constraints to understand. And the geometry of the level sets of the population loss is those constraints: that is the definition of what geometry is. It’s telling you all the different ways in which you can vary parameters in such a way that you obey the constraints.
So it’s in some sense tautological that the geometry of the population loss is the study of those constraints that are implicit in the data. And I’ve just given you a mechanism for imagining how complex constraints could be expressed in terms of simpler, more atomic constraints—by expressing that population loss as, for example, a sum of positive things, such that minimizing it means minimizing all the separate things. That would be one decomposition, which looks like an “and”. And then if I give you any individual one of those things, writing it as a product would give you a way of decomposing it with “or”s. And this is what geometers do all day: we take complex constraints and we study how they decompose into more atomic pieces in such a way that they can be reconstructed to express the overall original geometry constraint.
So this is how geometry can be applied to, first of all, why the structure in the data becomes structure of the geometry, and secondly, why the local learning coefficient, which is a measure of the complexity of that geometry… it’s conceptually quite natural to think about it as a measure of the complexity of the representation of the solution that you have in a given neighborhood of parameter space. Because at that point in parameter space, the loss function maybe doesn’t quite know about all the constraints because it’s only managed to represent some part of the structure, but to the extent that it’s representing the structure and the data, it is making the geometry complex in proportion to how much it has learned. And hence why the learning coefficient, which measures that geometry, is reflecting how much has been learned about the data. So that’s a kind of story for why this connection to geometry is not maybe as esoteric as it seems.
Following Daniel Murfet’s work
Filan: All right. Well, to close up, if people are interested in following your research, how should they do that?
Murfet: They can find me on Twitter at @DanielMurfet. But I think the main way to get in touch with the research and the community is to go to DevInterp.com, as I mentioned earlier, and make yourself known on the Discord. And feel free to ask questions there; we’re all on there and we’ll answer questions.
Filan: Cool. Another thing I want to plug there is there’s this YouTube channel, I think it’s called Developmental Interpretability.
Murfet: That’s right.
Filan: And it has a bunch of good talks by you and other people about this line of research into singular learning theory as well as the lectures that I attended. Great. Well, it’s been really nice having you on. Thank you for coming.
Murfet: Yeah, thanks, Daniel.
Filan: This episode is edited by Jack Garrett, and Amber Dawn Ace helped with transcription. The opening and closing themes are also by Jack Garrett. Financial support for this episode was provided by the Long-Term Future Fund, along with patrons such as Alexey Malafeev. To read a transcript of this episode or to learn how to support the podcast yourself, you can visit axrp.net. Finally, if you have any feedback about this podcast, you can email me at feedback@axrp.net.
Please just wait until you have the podcast link to post these to LW? We probably don’t want to read it if you went to the trouble of making a podcast.
This is now available as a podcast if you search. I don’t have the RSS feed link handy.
Sorry—YouTube’s taking an abnormally long time to process the video.
Update: there’s now a YouTube link
I’ve added a link to listen on Apple Podcasts.