TLDR: We explain why you should care about model complexity, why the local learning coefficient is arguably the correct measure of model complexity, and how to estimate its value.
In particular, we review a new set of estimation techniques introduced by Lau et al. (2023). These techniques are foundational to the Developmental Interpretability research agenda and constitute the first generation of methods for detecting and understanding phase transitions, with potential applications for both interpretability and mechanistic anomaly detection. We expect this set of techniques to become a fixture in the alignment toolkit, and we’ve published a library and examples to help you get started.
Comparing models matters for safety. Given two models with the same behavior on a particular set of evals, we would like to be able to predict how they’ll behave on out-of-distribution data. Can we distinguish the deceptively aligned model from the non-deceptively aligned model? As a first pass, can we predict that two models will or will not behave similarly in the future?
Comparing via weights. Because models are singular (parameters do not correspond one-to-one with functions), it’s not possible to compare weights directly. Very different choices of weights can implement the same functions, and very similar choices of weights can implement qualitatively different functions.
Comparing via behavior. We also can’t compare models at the level of the loss because different functions are compatible with the same loss. The same is true even if we compare models sample-by-sample: we’d need an astronomical number of inputs to meaningfully constrain behavior (some of which we’d rather not expose our model to).
Comparing via invariants. Another option is to take a pointer from the mathematicians and physicists: in classifying mathematical objects like spaces, it helps to turn geometric problems into algebraic problems where we can compute invariants. If two spaces have different values of that invariant, then we know that they’re different. If the values are the same, the results are inconclusive.
The local learning coefficient. The quantity we’ll look at in this post, the (local) learning coefficient^λ(w∗), is an invariant that measures model complexity — a coarse-grained feature of the model that is useful for making high-level comparisons with other models. Though having identical learning coefficients doesn’t imply that two models are the same, having different learning coefficients tells us that two models are qualitatively different.
The learning coefficient is unique in that it is an output of the theory rather than an ad-hoc invention. Except for the loss itself, the learning coefficient is the most principled way to compare different models.
In particular, we’re aiming to understand phase transitions during learning — moments in which models go through sudden, unanticipated shifts. We want to be able to detect these transitions because we think most of the risk is contained in sudden, undesired changes to values or capabilities that occur during training (~sharp left turns) and sudden, undesired changes to behaviors that occur in-context (treacherous turns / mechanistic anomalies).
Local learning coefficient estimation is one of the first steps towards building tools for detecting and understanding these transitions.
Do phase transitions actually show up? So far, the places where theoretically predicted phase transitions are easiest to confirm are simplified settings like deep linear networks and toy models of superposition. For larger models, we expect phase transitions to be common but “hidden.” Among our immediate priorities are testing just how common these transitions are and whether we can detect hidden transitions.
Deep linear networks. Let’s consider the example of deep linear networks (DLNs). These are neural networks without nonlinear activations. In terms of expressivity, these models are identical to a simple affine transformation but in terms of dynamics they’re much more interesting. The observed “saddle-to-saddle” transitions involve sudden jumps in the loss that coincide with sudden increases in the learning coefficient.
From Basins to Model Complexity
In this section, we’ll motivate the learning coefficient by attempting to formalize a working definition of basin volume (through which we’ll see that what we really care about is basin volume scaling). We’ll see in what sense the learning coefficient measures model complexity and examine the resulting perspective on phase transitions.
Intractable: Basin Volume
Basin volume and model simplicity. Learning theorists have long been interested in the link between basin volume and model simplicity. We don’t really care what model we end up with as long as its loss is below some threshold ϵ. This is particularly true as that threshold approaches its minimum: we want to know how many “almost-as-good” solutions there are in the neighborhood of an obtained solution w∗.
Intuitively, broader basins (with more suitable models) are simpler. Because they take up more volume, they require fewer bits to locate. In accordance with Occam’s razor, we want the simplest possible solution for the problem at hand.[1]
To measure a basin. In particular, given an upper bound to the loss L(w)≤ϵ, we want to compute the following volume (=”area” in the case of two parameters):
V(ϵ)=∫L(w)≤ϵφ(w)dw
for some prior φ(w) over weights w. For our purposes, we can safely neglect the prior and treat it as constant. Consider the toy loss landscape below, where V(ϵ) is indicated by the gray projection.
Model complexity and comparison. Every model in a given basin has the same basin volume. This metric is a coarse-grained observable (much like volume in thermodynamics) that partitions weight space into subclasses of similar models. So seeing that two models have similar volumes increases the odds that they’re computationally similar. Conversely, having different volumes / complexities suggests that two models are qualitatively distinct.
Unfortunately, actually computing basin volume is intractable. This is because of the curse of dimensionality (the volume scales exponentially with dimensions) and because we can only ever approximate the “true” population loss L(w) through an empirical loss Ln(w) evaluated on a finite dataset Dn={xi}ni=1.
Wrong: Basin Flatness
From broadness to flatness. The usual workaround to this intractability is to use a Taylor expansion: as long as we’re close enough to the bottom of the basin, we can approximate the basin locally with a polynomial.
Most people stop at a second-order approximation, modeling the basin as a paraboloid whose curvatures are given by the Hessian at the minimum. It’s a simple matter of then computing the resulting surface volume.
Unfortunately, this approximation is wrong. Minimum flatness is a terrible proxy for basin volume. The fundamental problem is that neural networks are singular: their loss landscapes have degenerate critical points. In practice, Hessians of the loss generally have many zero eigenvalues (often a majority). When we attempt to approximate the volume with a paraboloid that has zero curvature, we naively predict unphysical infinite volumes.
The standard correction is to add a small constant to these zero eigenvalues, which stabilizes the Hessian. We get a very large volume, instead of an infinite volume. But this misses the point: when second order terms vanish, the qualitative equivalence between a function and its second-order Taylor approximation breaks down, and these Hessian-based approximations become invalid.
Instead, we need to go to higher order terms in the Taylor expansion. The largest contributions to the volume will come from the first non-vanishing terms in the expansion.
Flatness ≠ broadness. In the literature, flatness (in the sense of a small determinant of the Hessian) and broadness often get conflated. This is unfortunate: the two are not the same. Basin broadness is an interesting question. Minimum flatness is the wrong question.
Instead: Basin Dimension
Volume dynamics. We care about relative — not absolute — volumes. But even more importantly, we’re interested in dynamics: how the volume changes as learning progresses. As we gather more and more data, we’re able to become more discriminating and our threshold ϵ comes down. The real question, then, is: how does the volume scale as ϵ changes? This is more relevant even in the non-degenerate regular case.
Volume scaling > volume. Studying volume scaling solves several others problems with the volume. First, for sufficiently small ϵ, relative volumes become independent of ϵ, which allows us to eliminate the dependence on this arbitrary cut-off. Furthermore, volume scaling is invariant under reparametrizations and non-diverging even near degenerate critical points.
The learning coefficient as effective dimensionality. The volume scales as
V(ϵ)∼cϵλasϵ→0.
The scaling exponent λ is known as the learning coefficient (or “real log canonical threshold”). It plays the role of an effective dimensionality (up to a constant multiple).
Let’s illustrate this with some examples.
Regular models. First, consider thecase of regular (=non-singular) models, where the loss landscape has no degenerate critical points. In this case, the Hessian-based approximation holds, and the volume grows with exponent λ=d/2, where d is the number of dimensions (or parameter count).
Singular models. Consider the function L(w1,w2)=w41+w42. At the origin, the Hessian is equal to the zero matrix, and the learning coefficient is λ=1/2.
Compare this to the function L(w1,w2)=w21w42. Again the Hessian at the origin is the zero matrix, but the learning coefficient now takes a different value λ=1/4.
For singular models, the parameter count sets an upper bound 2λ≤d. Lower learning coefficient means the volume scales slower for for large ϵ>1 but faster for small ϵ<1. There’s exponentially more volume in the immediate vicinity of degenerate critical points than in the vicinity of non-degenerate critical points.
Against Hessians. The two preceding examples explicitly demonstrate the failures of the Hessian mentioned earlier. Their Hessians are identical at the origin, but their scaling behaviors are radically different. The Hessian is a poor judge of volume, and derived metrics like the rank of the Hessian are insufficient notions of model complexity.[2]
The learning coefficient as an inductive bias. The Occam’s razor intuition that simpler models (with lower complexity) generalize better is formalized by Watanabe’s free energy formula,
Fn→nLn(w∗)+λlogn+O(loglogn)asn→∞,
where n is the number of training samples, Ln is the loss (more precisely, the negative log likelihood) which depends on a dataset Dn of n samples, and w∗ is an optimal choice of weights.
A full explanation and derivation of this formula is beyond the scope of this post (see Liam’s excellent sequence), but the important bit is that we can construe the problem of Bayesian learning as a problem of trying to minimize this free energy, involving a tradeoff between accuracy (lower nLn(w∗)) and complexity (lower λlogn). Neural networks are intrinsically biased towards simpler solutions.
Intuition for phase transitions. Consider a toy landscape with two local minima at w1and w2 with heights L(w1)>L(w2). Both minima have an associated learning coefficient that determines how the volume scales in its neighborhood. Suppose λ(w1)<λ(w2). Solution 1 is simpler but with higher loss, and solution 2 is more accurate but also more complex.
Coming back to the volume for a moment, what happens as we gradually lower the threshold ϵ?
What we’re interested in is the fraction of the surface area (=the volume of weight space) that is taken up by different kinds of solutions. At first, this surface area, and thus the behavior of our models, is dominated by the more degenerate solution at w1.
As we sample more data (n→∞) the sea starts to recede. When the sea level approaches the level of one of these highly degenerate, low-accuracy solutions, that solution suddenly and very rapidly loses a massive amount of surface area precisely because it is so degenerate. Shallow puddles vanish all at once.
The fraction of surface area taken up by successive, more complex solutions jumps just as suddenly and rapidly. There’s a phase transition — a qualitative change in the kind and complexity of functions being learned.
The sea continues to sink. From where we’re sitting — above the water — we can’t see below the surface. We don’t know where or when the next transition will take place. We hold our breaths...
Estimating Learning Coefficients
We’ve seen what the learning coefficient is and how it measures the complexity of a model. But how can we go about practically estimating its value? In this section, we describe a set of techniques introduced by Lau et al. (2023).
Deriving the Estimator
From free energies to learning coefficients. The free energy formula we saw earlier,
Fn→nLn(w∗)+λlogn+O(loglogn)asn→∞,
offers a means of calculating the learning coefficient. We simply shuffle the terms around to obtain:
λ→Fn−nLn(w∗)logn+O(loglognlogn).
WBIC. Unfortunately, performing the integral required to evaluate Fn is intractable. As an alternative, we can consider the Widely Applicable Bayesian Information Criterion (WBIC), which generalizes the Bayesian Information Criterion (BIC) to singular models,
sampled at inverse temperature β∗=1/logn(Watanabe, 2013). The two highest order terms in the expansion of the WBIC match that of the asymptotic expansion of Fn, which means we can substitute this in for Fn to estimate the learning coefficient:
^λ=Eβ∗w[nLn(w)]−nLn(w∗)logn.
Expectations to empirical averages. To use this estimator, we need to first estimate the expectation value Eβ∗w[nLn(w)]. We can do this by sampling a set of weights Wβ∗={w1,…,wm|wi∼pβ∗(w|Dn)} and replacing the expectation value with an empirical average over the losses associated to those weights,
Eβ∗w[nLn(w)]≈1mm∑i=1nLn(wi).
Unfortunately, estimating Eβ∗w[nLn(w)] in practice runs into significant curse-of-dimensionality effects for systems at scale.
The Local Learning Coefficient
From global to local. The problem becomes substantially more tractable if, instead of attempting to estimate the global learning coefficient, we estimate a local learning coefficient instead. We restrict our attention to a subset of weights and study how the volume scales in just that region. The scaling is dominated by the most complex local singularities.
As we saw in the aside on phase transitions, this local quantity is also more interesting from the perspective of dynamics. When actually training models, we don’t care about the behaviors of all possible models, we care about the behavior of a specific model as it moves through the loss landscape.
The local learning coefficient. To estimate the local learning coefficient, Lau et al. (2023) replace the prior φ(w) with a localizing prior centered at the point ^w∗ whose learning coefficient we want to estimate:
φ(w)→φ(w−^w∗).
This ensures that samples drawn will remain “close” to the original point. Substituting this into the global ^λ estimator we saw earlier, we obtain the definition of the local learning coefficient,
^λ(^w∗)=Eβ∗w|^w∗[nLn(w)]−nLn(^w∗)logn,
known colloquially as “lambda-hat,” where Ew|^w∗ indicates the use of a localizing prior.
To estimate this expectation value, we compute an empirical average over the losses of a set of samples {wi}mi=1, now drawn from the localized posterior:
p(w|Dn,^w∗)∝p(Dn|w)φ(w−^w∗).
Given this empirical average, we plug it into the above formula for ^λ (subtracting n times the loss at ^w∗ and dividing by logn) to obtain an estimate for the local learning coefficient.
Implementing the Estimator
How to localize? The first practical question we run into when implementing the local learning coefficient estimator is how to localize. Lau et al. consider a Gaussian prior parametrized by a scale γ. This is not the only possible choice, but it is perhaps the most straightforward choice: a gentle quadratic restoring force.
How to sample? To draw samples from the posterior pβ∗(w|Dn,^w∗,γ), the natural thought is to use MCMC techniques. Explaining why these work is beyond the scope of this piece and is well covered elsewhere, but what’s important here is that standard MCMC methods run into the curse of dimensionality, which makes them unsuitable for larger models.
SGLD. To scale MCMC methods to higher dimensions, we need to incorporate information from gradients. This allows us to “preselect” low-loss, high-posterior regions of weight space. One of the simplest examples of such a technique is Stochastic Gradient Langevin Dynamics (SGLD), which is SGD with the addition of an explicit noise term η∼N(0,ϵ) (Welling & Teh 2011). This allows random movement along low loss dimensions, while quickly pushing any step to a higher loss area back down.
SGNHT and beyond. Adding momentum & friction would allow us to explore these ridges of the posterior more effectively. One such example is the Stochastic Gradient Nosé-Hoover Thermostat (SGNHT), which uses a thermostat for its friction term (Ding et al. 2014). Although clearly performing better in the toy potential above, it’s still unclear how well this approach generalizes to higher dimensions as these samplers can be harder to get working than SGLD.
Limitations
Engineering limitations. The above estimators do not yet live up to their full theoretical potential. To give an example, ^λ(w) depends on hyperparameters like the number of samples, the localizing strength γ, the SGLD learning rate ϵ, etc. The true coefficient λ(w) is independent of these hyperparameters.
We care less about whether ^λ is an unbiased estimator of λ and more about whether this estimator satisfies certain basic desiderata of model complexity. These include properties like ordinality: that λ(w1)≥λ(w2) implies ^λ(w1)≥^λ(w2). Improving this estimator and confirming that it satisfies these desiderata is currently bottlenecked more on engineering than on theory.
Theoretical limitations. At the same time, this estimator sometimes exceeds its theoretical underpinnings, yielding sensible results despite violating certain technical assumptions. This means there’s work to do for theorists in terms of finding assumptions to relax and studying, for example, how to extend the learning coefficient outside of non-minima. Currently, if not applied at a local minimum, the estimator can sometimes yield unphysical negative model complexities.[3]
There are limiting assumptions underlying the learning coefficient and additional limiting assumptions underlying the local learning coefficient. Both require further attention.
Proceed with caution. The current techniques for estimating ^λ are finicky to get working and should be applied with care. This is a research tool, not a plug-and-play observable printing press. Expect annoying hyperparameter tuning, (un)correlated estimates, weirdly high ^λ‘s, negative ^λ‘s, infinite ^λ’s, etc. This requires at least as much caution as MCMC in any other high-dimensional system: pay careful attention to diagnostics of chain health and vary hyperparameters to hunt for dependencies, systematic or otherwise.
What’s Next
Improving estimates. An immediate priority is to confirm that ^λ and λ match where it counts. Does it satisfy our desiderata? And can we calibrate ^λ in settings where we can analytically derive λ? There’s work to do on the theoretical front and on the computational front in terms of exploring more scalable and more accurate ways to sample from the local posterior beyond SGLD.
More observables. SLT also tells us how to measure other quantities of interest, such as the singular fluctuation. We’re also working on the second and third generations of techniques for studying transitions that extract vector and higher-order information about how models are changing.
Go forth and estimate! Although the current implementation is far from perfect, it should be good enough to start applying these techniques in practice. It’s as easy as pip install devinterp and from devinterp.slt import estimate_learning_coeff (or at least, it’s easy up until that point). We recommend checking out these example notebooks for inspiration.
If you want to learn more about the learning coefficient, see this post, this lecture, or this book. For updates on developmental interpretability, join the discord. Stay tuned for more updates soon!
Actually extending the minimum description length principle (‘choose the model that takes the fewest bits to specify’) to singular models is an open problem. The free energy formula (covered later) is not quite the same, though it fulfills the same purpose of formalizing Occam’s razor.
To give the Hessian some credit: its rank can give an us a lower bound on the model complexity. The Hessian still has its place. (EDIT: this originally said upper bound — that was wrong)
This occurs when the sampler strays beyond its intended confines and stumbles across models with much lower loss than those in the desired neighborhood.
You’re Measuring Model Complexity Wrong
TLDR: We explain why you should care about model complexity, why the local learning coefficient is arguably the correct measure of model complexity, and how to estimate its value.
In particular, we review a new set of estimation techniques introduced by Lau et al. (2023). These techniques are foundational to the Developmental Interpretability research agenda and constitute the first generation of methods for detecting and understanding phase transitions, with potential applications for both interpretability and mechanistic anomaly detection. We expect this set of techniques to become a fixture in the alignment toolkit, and we’ve published a library and examples to help you get started.
Why Model Complexity?
Model Comparison
Comparing models matters for safety. Given two models with the same behavior on a particular set of evals, we would like to be able to predict how they’ll behave on out-of-distribution data. Can we distinguish the deceptively aligned model from the non-deceptively aligned model? As a first pass, can we predict that two models will or will not behave similarly in the future?
Comparing via weights. Because models are singular (parameters do not correspond one-to-one with functions), it’s not possible to compare weights directly. Very different choices of weights can implement the same functions, and very similar choices of weights can implement qualitatively different functions.
Comparing via behavior. We also can’t compare models at the level of the loss because different functions are compatible with the same loss. The same is true even if we compare models sample-by-sample: we’d need an astronomical number of inputs to meaningfully constrain behavior (some of which we’d rather not expose our model to).
Comparing via invariants. Another option is to take a pointer from the mathematicians and physicists: in classifying mathematical objects like spaces, it helps to turn geometric problems into algebraic problems where we can compute invariants. If two spaces have different values of that invariant, then we know that they’re different. If the values are the same, the results are inconclusive.
The local learning coefficient. The quantity we’ll look at in this post, the (local) learning coefficient ^λ(w∗), is an invariant that measures model complexity — a coarse-grained feature of the model that is useful for making high-level comparisons with other models. Though having identical learning coefficients doesn’t imply that two models are the same, having different learning coefficients tells us that two models are qualitatively different.
The learning coefficient is unique in that it is an output of the theory rather than an ad-hoc invention. Except for the loss itself, the learning coefficient is the most principled way to compare different models.
Model Development
Model comparison is necessary to study model development. We’re interested in comparing models across time to understand how they develop. High-level observables like the local learning coefficient give us insight into this process without requiring a detailed understanding of the model’s internal mechanisms.
In particular, we’re aiming to understand phase transitions during learning — moments in which models go through sudden, unanticipated shifts. We want to be able to detect these transitions because we think most of the risk is contained in sudden, undesired changes to values or capabilities that occur during training (~sharp left turns) and sudden, undesired changes to behaviors that occur in-context (treacherous turns / mechanistic anomalies).
Local learning coefficient estimation is one of the first steps towards building tools for detecting and understanding these transitions.
Do phase transitions actually show up? So far, the places where theoretically predicted phase transitions are easiest to confirm are simplified settings like deep linear networks and toy models of superposition. For larger models, we expect phase transitions to be common but “hidden.” Among our immediate priorities are testing just how common these transitions are and whether we can detect hidden transitions.
Deep linear networks. Let’s consider the example of deep linear networks (DLNs). These are neural networks without nonlinear activations. In terms of expressivity, these models are identical to a simple affine transformation but in terms of dynamics they’re much more interesting. The observed “saddle-to-saddle” transitions involve sudden jumps in the loss that coincide with sudden increases in the learning coefficient.
From Basins to Model Complexity
In this section, we’ll motivate the learning coefficient by attempting to formalize a working definition of basin volume (through which we’ll see that what we really care about is basin volume scaling). We’ll see in what sense the learning coefficient measures model complexity and examine the resulting perspective on phase transitions.
Intractable: Basin Volume
Basin volume and model simplicity. Learning theorists have long been interested in the link between basin volume and model simplicity. We don’t really care what model we end up with as long as its loss is below some threshold ϵ. This is particularly true as that threshold approaches its minimum: we want to know how many “almost-as-good” solutions there are in the neighborhood of an obtained solution w∗.
Intuitively, broader basins (with more suitable models) are simpler. Because they take up more volume, they require fewer bits to locate. In accordance with Occam’s razor, we want the simplest possible solution for the problem at hand.[1]
To measure a basin. In particular, given an upper bound to the loss L(w)≤ϵ, we want to compute the following volume (=”area” in the case of two parameters):
V(ϵ)=∫L(w)≤ϵφ(w)dwfor some prior φ(w) over weights w. For our purposes, we can safely neglect the prior and treat it as constant. Consider the toy loss landscape below, where V(ϵ) is indicated by the gray projection.
Model complexity and comparison. Every model in a given basin has the same basin volume. This metric is a coarse-grained observable (much like volume in thermodynamics) that partitions weight space into subclasses of similar models. So seeing that two models have similar volumes increases the odds that they’re computationally similar. Conversely, having different volumes / complexities suggests that two models are qualitatively distinct.
Unfortunately, actually computing basin volume is intractable. This is because of the curse of dimensionality (the volume scales exponentially with dimensions) and because we can only ever approximate the “true” population loss L(w) through an empirical loss Ln(w) evaluated on a finite dataset Dn={xi}ni=1.
Wrong: Basin Flatness
From broadness to flatness. The usual workaround to this intractability is to use a Taylor expansion: as long as we’re close enough to the bottom of the basin, we can approximate the basin locally with a polynomial.
Most people stop at a second-order approximation, modeling the basin as a paraboloid whose curvatures are given by the Hessian at the minimum. It’s a simple matter of then computing the resulting surface volume.
Unfortunately, this approximation is wrong. Minimum flatness is a terrible proxy for basin volume. The fundamental problem is that neural networks are singular: their loss landscapes have degenerate critical points. In practice, Hessians of the loss generally have many zero eigenvalues (often a majority). When we attempt to approximate the volume with a paraboloid that has zero curvature, we naively predict unphysical infinite volumes.
The standard correction is to add a small constant to these zero eigenvalues, which stabilizes the Hessian. We get a very large volume, instead of an infinite volume. But this misses the point: when second order terms vanish, the qualitative equivalence between a function and its second-order Taylor approximation breaks down, and these Hessian-based approximations become invalid.
Instead, we need to go to higher order terms in the Taylor expansion. The largest contributions to the volume will come from the first non-vanishing terms in the expansion.
Flatness ≠ broadness. In the literature, flatness (in the sense of a small determinant of the Hessian) and broadness often get conflated. This is unfortunate: the two are not the same. Basin broadness is an interesting question. Minimum flatness is the wrong question.
Instead: Basin Dimension
Volume dynamics. We care about relative — not absolute — volumes. But even more importantly, we’re interested in dynamics: how the volume changes as learning progresses. As we gather more and more data, we’re able to become more discriminating and our threshold ϵ comes down. The real question, then, is: how does the volume scale as ϵ changes? This is more relevant even in the non-degenerate regular case.
Volume scaling > volume. Studying volume scaling solves several others problems with the volume. First, for sufficiently small ϵ, relative volumes become independent of ϵ, which allows us to eliminate the dependence on this arbitrary cut-off. Furthermore, volume scaling is invariant under reparametrizations and non-diverging even near degenerate critical points.
The learning coefficient as effective dimensionality. The volume scales as
V(ϵ)∼cϵλasϵ→0.The scaling exponent λ is known as the learning coefficient (or “real log canonical threshold”). It plays the role of an effective dimensionality (up to a constant multiple).
Let’s illustrate this with some examples.
Regular models. First, consider the case of regular (=non-singular) models, where the loss landscape has no degenerate critical points. In this case, the Hessian-based approximation holds, and the volume grows with exponent λ=d/2, where d is the number of dimensions (or parameter count).
Singular models. Consider the function L(w1,w2)=w41+w42. At the origin, the Hessian is equal to the zero matrix, and the learning coefficient is λ=1/2.
Compare this to the function L(w1,w2)=w21w42. Again the Hessian at the origin is the zero matrix, but the learning coefficient now takes a different value λ=1/4.
For singular models, the parameter count sets an upper bound 2λ≤d. Lower learning coefficient means the volume scales slower for for large ϵ>1 but faster for small ϵ<1. There’s exponentially more volume in the immediate vicinity of degenerate critical points than in the vicinity of non-degenerate critical points.
Against Hessians. The two preceding examples explicitly demonstrate the failures of the Hessian mentioned earlier. Their Hessians are identical at the origin, but their scaling behaviors are radically different. The Hessian is a poor judge of volume, and derived metrics like the rank of the Hessian are insufficient notions of model complexity.[2]
The learning coefficient as an inductive bias. The Occam’s razor intuition that simpler models (with lower complexity) generalize better is formalized by Watanabe’s free energy formula,
Fn→nLn(w∗)+λlogn+O(loglogn)asn→∞,where n is the number of training samples, Ln is the loss (more precisely, the negative log likelihood) which depends on a dataset Dn of n samples, and w∗ is an optimal choice of weights.
A full explanation and derivation of this formula is beyond the scope of this post (see Liam’s excellent sequence), but the important bit is that we can construe the problem of Bayesian learning as a problem of trying to minimize this free energy, involving a tradeoff between accuracy (lower nLn(w∗)) and complexity (lower λlogn). Neural networks are intrinsically biased towards simpler solutions.
Intuition for phase transitions. Consider a toy landscape with two local minima at w1and w2 with heights L(w1)>L(w2). Both minima have an associated learning coefficient that determines how the volume scales in its neighborhood. Suppose λ(w1)<λ(w2). Solution 1 is simpler but with higher loss, and solution 2 is more accurate but also more complex.
Coming back to the volume for a moment, what happens as we gradually lower the threshold ϵ?
What we’re interested in is the fraction of the surface area (=the volume of weight space) that is taken up by different kinds of solutions. At first, this surface area, and thus the behavior of our models, is dominated by the more degenerate solution at w1.
As we sample more data (n→∞) the sea starts to recede. When the sea level approaches the level of one of these highly degenerate, low-accuracy solutions, that solution suddenly and very rapidly loses a massive amount of surface area precisely because it is so degenerate. Shallow puddles vanish all at once.
The fraction of surface area taken up by successive, more complex solutions jumps just as suddenly and rapidly. There’s a phase transition — a qualitative change in the kind and complexity of functions being learned.
The sea continues to sink. From where we’re sitting — above the water — we can’t see below the surface. We don’t know where or when the next transition will take place. We hold our breaths...
Estimating Learning Coefficients
We’ve seen what the learning coefficient is and how it measures the complexity of a model. But how can we go about practically estimating its value? In this section, we describe a set of techniques introduced by Lau et al. (2023).
Deriving the Estimator
From free energies to learning coefficients. The free energy formula we saw earlier,
Fn→nLn(w∗)+λlogn+O(loglogn)asn→∞,offers a means of calculating the learning coefficient. We simply shuffle the terms around to obtain:
λ→Fn−nLn(w∗)logn+O(loglognlogn).WBIC. Unfortunately, performing the integral required to evaluate Fn is intractable. As an alternative, we can consider the Widely Applicable Bayesian Information Criterion (WBIC), which generalizes the Bayesian Information Criterion (BIC) to singular models,
WBIC:=Eβ∗w[nLn(w)]=∫nLn(w)pβ∗(w|Dn)dw,where pβ∗(w|Dn) is the tempered Bayesian posterior,
pβ(w|Dn)=p(Dn|w)βφ(w)pβ(Dn),sampled at inverse temperature β∗=1/logn (Watanabe, 2013). The two highest order terms in the expansion of the WBIC match that of the asymptotic expansion of Fn, which means we can substitute this in for Fn to estimate the learning coefficient:
^λ=Eβ∗w[nLn(w)]−nLn(w∗)logn.Expectations to empirical averages. To use this estimator, we need to first estimate the expectation value Eβ∗w[nLn(w)]. We can do this by sampling a set of weights Wβ∗={w1,…,wm|wi∼pβ∗(w|Dn)} and replacing the expectation value with an empirical average over the losses associated to those weights,
Eβ∗w[nLn(w)]≈1mm∑i=1nLn(wi).Unfortunately, estimating Eβ∗w[nLn(w)] in practice runs into significant curse-of-dimensionality effects for systems at scale.
The Local Learning Coefficient
From global to local. The problem becomes substantially more tractable if, instead of attempting to estimate the global learning coefficient, we estimate a local learning coefficient instead. We restrict our attention to a subset of weights and study how the volume scales in just that region. The scaling is dominated by the most complex local singularities.
As we saw in the aside on phase transitions, this local quantity is also more interesting from the perspective of dynamics. When actually training models, we don’t care about the behaviors of all possible models, we care about the behavior of a specific model as it moves through the loss landscape.
The local learning coefficient. To estimate the local learning coefficient, Lau et al. (2023) replace the prior φ(w) with a localizing prior centered at the point ^w∗ whose learning coefficient we want to estimate:
φ(w)→φ(w−^w∗).This ensures that samples drawn will remain “close” to the original point. Substituting this into the global ^λ estimator we saw earlier, we obtain the definition of the local learning coefficient,
^λ(^w∗)=Eβ∗w|^w∗[nLn(w)]−nLn(^w∗)logn,known colloquially as “lambda-hat,” where Ew|^w∗ indicates the use of a localizing prior.
To estimate this expectation value, we compute an empirical average over the losses of a set of samples {wi}mi=1, now drawn from the localized posterior:
p(w|Dn,^w∗)∝p(Dn|w)φ(w−^w∗).Given this empirical average, we plug it into the above formula for ^λ (subtracting n times the loss at ^w∗ and dividing by logn) to obtain an estimate for the local learning coefficient.
Implementing the Estimator
How to localize? The first practical question we run into when implementing the local learning coefficient estimator is how to localize. Lau et al. consider a Gaussian prior parametrized by a scale γ. This is not the only possible choice, but it is perhaps the most straightforward choice: a gentle quadratic restoring force.
How to sample? To draw samples from the posterior pβ∗(w|Dn,^w∗,γ), the natural thought is to use MCMC techniques. Explaining why these work is beyond the scope of this piece and is well covered elsewhere, but what’s important here is that standard MCMC methods run into the curse of dimensionality, which makes them unsuitable for larger models.
SGLD. To scale MCMC methods to higher dimensions, we need to incorporate information from gradients. This allows us to “preselect” low-loss, high-posterior regions of weight space. One of the simplest examples of such a technique is Stochastic Gradient Langevin Dynamics (SGLD), which is SGD with the addition of an explicit noise term η∼N(0,ϵ) (Welling & Teh 2011). This allows random movement along low loss dimensions, while quickly pushing any step to a higher loss area back down.
SGNHT and beyond. Adding momentum & friction would allow us to explore these ridges of the posterior more effectively. One such example is the Stochastic Gradient Nosé-Hoover Thermostat (SGNHT), which uses a thermostat for its friction term (Ding et al. 2014). Although clearly performing better in the toy potential above, it’s still unclear how well this approach generalizes to higher dimensions as these samplers can be harder to get working than SGLD.
Limitations
Engineering limitations. The above estimators do not yet live up to their full theoretical potential. To give an example, ^λ(w) depends on hyperparameters like the number of samples, the localizing strength γ, the SGLD learning rate ϵ, etc. The true coefficient λ(w) is independent of these hyperparameters.
We care less about whether ^λ is an unbiased estimator of λ and more about whether this estimator satisfies certain basic desiderata of model complexity. These include properties like ordinality: that λ(w1)≥λ(w2) implies ^λ(w1)≥^λ(w2). Improving this estimator and confirming that it satisfies these desiderata is currently bottlenecked more on engineering than on theory.
Theoretical limitations. At the same time, this estimator sometimes exceeds its theoretical underpinnings, yielding sensible results despite violating certain technical assumptions. This means there’s work to do for theorists in terms of finding assumptions to relax and studying, for example, how to extend the learning coefficient outside of non-minima. Currently, if not applied at a local minimum, the estimator can sometimes yield unphysical negative model complexities.[3]
There are limiting assumptions underlying the learning coefficient and additional limiting assumptions underlying the local learning coefficient. Both require further attention.
Proceed with caution. The current techniques for estimating ^λ are finicky to get working and should be applied with care. This is a research tool, not a plug-and-play observable printing press. Expect annoying hyperparameter tuning, (un)correlated estimates, weirdly high ^λ‘s, negative ^λ‘s, infinite ^λ’s, etc. This requires at least as much caution as MCMC in any other high-dimensional system: pay careful attention to diagnostics of chain health and vary hyperparameters to hunt for dependencies, systematic or otherwise.
What’s Next
Improving estimates. An immediate priority is to confirm that ^λ and λ match where it counts. Does it satisfy our desiderata? And can we calibrate ^λ in settings where we can analytically derive λ? There’s work to do on the theoretical front and on the computational front in terms of exploring more scalable and more accurate ways to sample from the local posterior beyond SGLD.
More observables. SLT also tells us how to measure other quantities of interest, such as the singular fluctuation. We’re also working on the second and third generations of techniques for studying transitions that extract vector and higher-order information about how models are changing.
Go forth and estimate! Although the current implementation is far from perfect, it should be good enough to start applying these techniques in practice. It’s as easy as
pip install devinterp
andfrom devinterp.slt import estimate_learning_coeff
(or at least, it’s easy up until that point). We recommend checking out these example notebooks for inspiration.If you want to learn more about the learning coefficient, see this post, this lecture, or this book. For updates on developmental interpretability, join the discord. Stay tuned for more updates soon!
Actually extending the minimum description length principle (‘choose the model that takes the fewest bits to specify’) to singular models is an open problem. The free energy formula (covered later) is not quite the same, though it fulfills the same purpose of formalizing Occam’s razor.
To give the Hessian some credit: its rank can give an us a lower bound on the model complexity. The Hessian still has its place. (EDIT: this originally said upper bound — that was wrong)
This occurs when the sampler strays beyond its intended confines and stumbles across models with much lower loss than those in the desired neighborhood.