First of all, I really like the images, they made things easier to understand and are pretty. Good work with that!
My biggest problem with this is the unclear applicability of this to alignment. Why do we want to predict scaling laws?
Doesn’t that mostly promote AI capabilities, and not alignment very much?
Second, I feel like there’s a confusion over several probability distributions and potential functions going on
The singularities are those of the likelihood ratio Kn(w)
We care about the generalization error Gn with respect to some prior ϕ(w), but the latter doesn’t have any
effect on the dynamics of SGD or on what the singularity is
The Watanabe limit (Fn as n→∞) and the restricted free energy Fn(W(i)) all are presented on results, which rely on the singularities, and somehow predict generalization. But all of these depend on the prior ϕ, and earlier we’ve defined the singularities to be of the likelihood function; plus SGD actually only uses the likelihood function for its dynamics.
What is going on here?
It’s also unclear what the takeaway from this post is. How can we predict generalization or dynamics from these things?
Are there any empirical results on this?
Some clarifying questions / possible mistakes:
Kn(w) is not a KL divergence, the terms of the sum should be multiplied by q(x) or p(x|w).
the Hamiltonian is a random process given by the log likelihood ratio function
Also given by the prior, if we go by the equation just above that. Also where does “ratio” come from? Likelihood ratios
we can find in the Metropolis-Hastings transition probabilities, but you didn’t even mention that here. I’m confused.
But that just gives us the KL divergence.
I’m not sure where you get this. Is it from the fact that predicting p(x | w) = q(x) is optimal, because the actual
probability of a data point is q(x) ? If not it’d be nice to specify.
the minima of the term in the exponent, K (w) , are equal to 0.
This is only true for the global minima, but for the dynamics of learning we also care about local minima (that may be
higher than 0). Are we implicitly assuming that most local minima are also global? Is this true of actual NNs?
the asymptotic form of the free energy as n→∞
This is only true when the weights w0 are close to the singularity right? Also what is λ, seems like it’s the RLCT but this isn’t stated
First of all, I really like the images, they made things easier to understand and are pretty. Good work with that!
Thank you!
My biggest problem with this is the unclear applicability of this to alignment. Why do we want to predict scaling laws? Doesn’t that mostly promote AI capabilities, and not alignment very much?
This is also my biggest source of uncertainty on the whole agenda. There’s definitely a capabilities risk, but I think the benefits to understanding NNs currently much outweigh the benefits to improving them.
In particular, I think that understanding generalization is pretty key to making sense of outer and inner alignment. If “singularities = generalization” holds up, then our task seems to become quite a lot easier: we only have to understand a few isolated points of the loss landscape instead of the full exponential hell that is a billions-dimensional system.
In a similar vein, I think that this is one of the most promising paths to understanding what’s going on during training. When we talk about phase changes / sharp left turns / etc., what we may really be talking about are discrete changes in the local singularity structure of the loss landscape. Understanding singularities seems key to predicting and anticipating these changes just as understanding critical points is key to predicting and anticipating phase transitions in physical systems.
We care about the generalization error Gn with respect to some prior ϕ(w), but the latter doesn’t have any effect on the dynamics of SGD or on what the singularity is
The Watanabe limit (Fn as n→∞) and the restricted free energy Fn(W(i)) all are presented on results, which rely on the singularities, and somehow predict generalization. But all of these depend on the prior ϕ, and earlier we’ve defined the singularities to be of the likelihood function; plus SGD actually only uses the likelihood function for its dynamics.
As long as your prior has non-zero support on the singularities, the results hold up (because we’re taking this large-N limit where the prior becomes less important). Like I mention in the objections, linking this to SGD is going to require more work. To first order, when your prior has support over only a compact subset of weight space, your behavior is dominated by the singularities in that set (this is another way to view the comments on phase transitions).
It’s also unclear what the takeaway from this post is. How can we predict generalization or dynamics from these things? Are there any empirical results on this?
This is very much a work in progress.
In statistical physics, much of our analysis is built on the assumption that we can replace temporal averages with phase-space averages. This is justified on grounds of the ergodic hypothesis. In singular learning theory, we’ve jumped to parameter (phase)-space averages without doing the important translation work from training (temporal) averages. SGD is not ergodic, so this will require care. That the exact asymptotic forms may look different in the case of SGD seems probable. That the asymptotic forms for SGD make no reference to singularities seems unlikely. The basic takeaway is that singularities matter disproportionately, and if we’re going to try to develop a theory of DNNs, they will likely form an important component.
For (early) empirical results, I’d check out the theses mentioned here.
Kn(w) is not a KL divergence, the terms of the sum should be multiplied by q(x) or p(x|w).
Kn(w) is an empirical KL divergence. It’s multiplied by the empirical distribution, qn(x), which just puts 1/n probability on the observed samples (and 0 elsewhere),
qn(x):=1nn∑i=1δ(x−xi).
the Hamiltonian is a random process given by the log likelihood ratio function
Also given by the prior, if we go by the equation just above that. Also where does “ratio” come from?
Yes, also the prior, thanks for the correction.The ratio comes from doing the normalization (“log likelihood ratio” is just another one of Watanabe’s name for the empirical KL divergence). In the following definition,
Kn(w)=L0n(w):=Ln(w)−Sn=1nn∑i=1logq(Xi)p(Xi|w),
the likelihood ratio is
p(Xi|w)q(Xi).
But that just gives us the KL divergence.
I’m not sure where you get this. Is it from the fact that predicting p(x | w) = q(x) is optimal, because the actual probability of a data point is q(x) ? If not it’d be nice to specify.
the minima of the term in the exponent, K (w) , are equal to 0.
This is only true for the global minima, but for the dynamics of learning we also care about local minima (that may be higher than 0). Are we implicitly assuming that most local minima are also global? Is this true of actual NNs?
This is the comment in footnote 3. Like you say, it relies on the assumption of realizability (there being a global minimum of Kn(w)) which is not very realistic! As I point out in the objections, we can sometimes fix this, but not always (yet).
the asymptotic form of the free energy as n→∞
This is only true when the weights w0 are close to the singularity right?
That’s the crazy thing. You do the integral over all the weights to get the model evidence, and it’s totally dominated by just these few weights. Again, when we’re making the change to SGD, this probably changes.
Also what is λ, seems like it’s the RLCT but this isn’t stated.
(Dan Murfet’s personal views here) First some caveats: although we are optimistic SLT can be developed into theory of deep learning, it is not currently such a theory and it remains possible that there are fundamental obstacles. Putting those aside for a moment, it is plausible that phenomena like scaling laws and the related emergence of capabilities like in-context learning can be understood from first principles within a framework like SLT. This could contribute both to capabilities research and safety research.
Contribution to capabilities. Right now it is not understood why Transformers obey scaling laws, and how capabilities like in-context learning relate to scaling in the loss; improved theoretical understanding could increase scaling exponents or allow them to be engineered in smaller systems. For example, some empirical work already suggests that certain data distributions lead to in-context learning. It is possible that theoretical work could inspire new ideas. Thermodynamics wasn’t necessary to build steam engines, but it helped to push the technology to new levels of capability once the systems became too big and complex for tinkering.
Contribution to alignment. Broadly speaking it is hard to align what you do not understand. Either the aspects of intelligence relevant for alignment are universal, or they are not. If they are not, we have to get lucky (and stay lucky as the systems scale). If the relevant aspects are universal (in the sense that they arise for fundamental reasons in sufficiently intelligent systems across multiple different substrates) we can try to understand them and use them to control/align advanced systems (or at least be forewarned about their dangers) and be reasonably certain that the phenomena continue to behave as predicted across scales. This is one motivation behind the work on properties of optimal agents, such as Logical Inductors. SLT is a theory of universal aspects of learning machines, it could perhaps be developed in similar directions.
Does understanding scaling laws contribute to safety?. It depends on what is causing scaling laws. If, as we suspect, it is about phases and phase transitions then it is related to the nature of the structures that emerge during training which are responsible for these phase transitions (e.g. concepts). A theory of interpretability scalable enough to align advanced systems may need to develop a fundamental theory of abstractions, especially if these are related to the phenomena around scaling laws and emergent capabilities.
Our take on this has been partly spelled out in the Abstraction seminar. We’re trying to develop existing links in mathematical physics between renormalisation group flow and resolution of singularities, which applied in the context of SLT might give a fundamental understanding of how abstractions emerge in learning machines. One best case scenario of the application of SLT to alignment is that this line of research gives a theoretical framework in which to understand more empirical interpretability work.
The utility of theory in general and SLT in particular depends on your mental model of the problem landscape between here and AGI. To return to the thermodynamics analogy: a working theory of thermodynamics isn’t necessary to build train engines, but it’s probably necessary to build rockets. If you think the engineering-driven approach that has driven deep learning so far will plateau before AGI, probably theory research is bad in expected value. If you think theory isn’t necessary to get to AGI, then it may be a risk that we have to take.
Summary: In my view we seem to know enough to get to AGI. We do not know enough to get to alignment. Ergo we have to take some risks.
First of all, I really like the images, they made things easier to understand and are pretty. Good work with that!
My biggest problem with this is the unclear applicability of this to alignment. Why do we want to predict scaling laws? Doesn’t that mostly promote AI capabilities, and not alignment very much?
Second, I feel like there’s a confusion over several probability distributions and potential functions going on
The singularities are those of the likelihood ratio Kn(w)
We care about the generalization error Gn with respect to some prior ϕ(w), but the latter doesn’t have any effect on the dynamics of SGD or on what the singularity is
The Watanabe limit (Fn as n→∞) and the restricted free energy Fn(W(i)) all are presented on results, which rely on the singularities, and somehow predict generalization. But all of these depend on the prior ϕ, and earlier we’ve defined the singularities to be of the likelihood function; plus SGD actually only uses the likelihood function for its dynamics.
What is going on here?
It’s also unclear what the takeaway from this post is. How can we predict generalization or dynamics from these things? Are there any empirical results on this?
Some clarifying questions / possible mistakes:
Kn(w) is not a KL divergence, the terms of the sum should be multiplied by q(x) or p(x|w).
Also given by the prior, if we go by the equation just above that. Also where does “ratio” come from? Likelihood ratios we can find in the Metropolis-Hastings transition probabilities, but you didn’t even mention that here. I’m confused.
I’m not sure where you get this. Is it from the fact that predicting p(x | w) = q(x) is optimal, because the actual probability of a data point is q(x) ? If not it’d be nice to specify.
This is only true for the global minima, but for the dynamics of learning we also care about local minima (that may be higher than 0). Are we implicitly assuming that most local minima are also global? Is this true of actual NNs?
This is only true when the weights w0 are close to the singularity right? Also what is λ, seems like it’s the RLCT but this isn’t stated
Thank you!
This is also my biggest source of uncertainty on the whole agenda. There’s definitely a capabilities risk, but I think the benefits to understanding NNs currently much outweigh the benefits to improving them.
In particular, I think that understanding generalization is pretty key to making sense of outer and inner alignment. If “singularities = generalization” holds up, then our task seems to become quite a lot easier: we only have to understand a few isolated points of the loss landscape instead of the full exponential hell that is a billions-dimensional system.
In a similar vein, I think that this is one of the most promising paths to understanding what’s going on during training. When we talk about phase changes / sharp left turns / etc., what we may really be talking about are discrete changes in the local singularity structure of the loss landscape. Understanding singularities seems key to predicting and anticipating these changes just as understanding critical points is key to predicting and anticipating phase transitions in physical systems.
As long as your prior has non-zero support on the singularities, the results hold up (because we’re taking this large-N limit where the prior becomes less important). Like I mention in the objections, linking this to SGD is going to require more work. To first order, when your prior has support over only a compact subset of weight space, your behavior is dominated by the singularities in that set (this is another way to view the comments on phase transitions).
This is very much a work in progress.
In statistical physics, much of our analysis is built on the assumption that we can replace temporal averages with phase-space averages. This is justified on grounds of the ergodic hypothesis. In singular learning theory, we’ve jumped to parameter (phase)-space averages without doing the important translation work from training (temporal) averages. SGD is not ergodic, so this will require care. That the exact asymptotic forms may look different in the case of SGD seems probable. That the asymptotic forms for SGD make no reference to singularities seems unlikely. The basic takeaway is that singularities matter disproportionately, and if we’re going to try to develop a theory of DNNs, they will likely form an important component.
For (early) empirical results, I’d check out the theses mentioned here.
Kn(w) is an empirical KL divergence. It’s multiplied by the empirical distribution, qn(x), which just puts 1/n probability on the observed samples (and 0 elsewhere),
qn(x):=1nn∑i=1δ(x−xi).Yes, also the prior, thanks for the correction.The ratio comes from doing the normalization (“log likelihood ratio” is just another one of Watanabe’s name for the empirical KL divergence). In the following definition,
Kn(w)=L0n(w):=Ln(w)−Sn=1nn∑i=1logq(Xi)p(Xi|w),the likelihood ratio is
p(Xi|w)q(Xi).This is the comment in footnote 3. Like you say, it relies on the assumption of realizability (there being a global minimum of Kn(w)) which is not very realistic! As I point out in the objections, we can sometimes fix this, but not always (yet).
That’s the crazy thing. You do the integral over all the weights to get the model evidence, and it’s totally dominated by just these few weights. Again, when we’re making the change to SGD, this probably changes.
Yes, I’ve made an edit. Thanks!
Let me add some more views on SLT and capabilities/alignment.
Quoting Dan Murfet: