a few thoughts on hyperparams for a better learning theory (for understanding what happens when a neural net is trained with gradient descent)
Having found myself repeating the same points/claims in various conversations about what NN learning is like (especially around singular learning theory), I figured it’s worth writing some of them down. My typical confidence in a claim below is like 95%[1]. I’m not claiming anything here is significantly novel. The claims/points:
local learning (eg gradient descent) strongly does not find global optima. insofar as running a local learning process from many seeds produces outputs with ‘similar’ (train or test) losses, that’s a law of large numbers phenomenon[2], not a consequence of always finding the optimal neural net weights.[3][4]
if your method can’t produce better weights: were you trying to produce better weights by running gradient descent from a bunch of different starting points? getting similar losses this way is a LLN phenomenon
maybe this is a crisp way to see a counterexample instead: train, then identify a ‘lottery ticket’ subnetwork after training like done in that literature. now get rid of all other edges in the network, and retrain that subnetwork either from the previous initialization or from a new initialization — i think this literature says that you get a much worse loss in the latter case. so training from a random initialization here gives a much worse loss than possible
dynamics (kinetics) matter(s). the probability of getting to a particular training endpoint is highly dependent not just on stuff that is evident from the neighborhood of that point, but on there being a way to make those structures incrementally, ie by a sequence of local moves each of which is individually useful.[5][6][7] i think that this is not an academic correction, but a major one — the structures found in practice are very massively those with sensible paths into them and not other (naively) similarly complex structures. some stuff to consider:
(given a toy setup and in a certain limit,) the hardness of learning a boolean function being characterized by its leap complexity, ie the size of the ‘largest step’ between its fourier terms, https://arxiv.org/pdf/2302.11055
imagine a loss function on a plane which has a crater somewhere and another crater with a valley descending into it somewhere else. the local neighborhoods of the deepest points of the two craters can look the same, but the crater with a valley descending into it will have a massively larger drainage basin. to say more: the crater with a valley is a case where it is first loss-decreasing to build one simple thing, (ie in this case to fix the value of one parameter), and once you’ve done that loss-decreasing to build another simple thing (ie in this case to fix the value of another parameter); getting to the isolated crater is more like having to build two things at once. i think that with a reasonable way to make things precise, the drainage basin of a ‘k-parameter structure’ with no valley descending into it will be exponentially smaller than that of eg a ‘k-parameter structure’ with ‘a k/2-parameter valley’ descending into it, which will be exponentially smaller still than a ‘k-parameter structure’ with a sequence of valleys of slowly increasing dimension descending into it
it seems plausible to me that the right way to think about stuff will end up revealing that in practice there are basically only systems of steps where a single [very small thing]/parameter gets developed/fixed at a time
i’m further guessing that most structures basically have ‘one way’ to descend into them (tho if you consider sufficiently different structures to be the same, then this can be false, like in examples of convergent evolution) and that it’s nice to think of the probability of finding the structure as the product over steps of the probability of making the right choice on that step (of falling in the right part of a partition determining which next thing gets built)
one correction/addition to the above is that it’s probably good to see things in terms of there being many ‘independent’ structures/circuits being formed in parallel, creating some kind of ecology of different structures/circuits. maybe it makes sense to track the ‘effective loss’ created for a structure/circuit by the global loss (typically including weight norm) together with the other structures present at a time? (or can other structures do sufficiently orthogonal things that it’s fine to ignore this correction in some cases?) maybe it’s possible to have structures which were initially independent be combined into larger structures?[8]
everything is a loss phenomenon. if something is ever a something-else phenomenon, that’s logically downstream of a relation between that other thing and loss (but this isn’t to say you shouldn’t be trying to find these other nice things related to loss)
grokking happens basically only in the presence of weight regularization, and it has to do with there being slower structures to form which are eventually more efficient at making logits high (ie more logit bang for weight norm buck)
in the usual case that generalization starts to happen immediately, this has to do with generalizing structures being stronger attractors even at initialization. one consideration at play here is that
nothing interesting ever happens during a random walk on a loss min surface
it’s not clear that i’m conceiving of structures/circuits correctly/well in the above. i think it would help a library of like >10 well-understood toy models (as opposed to like the maybe 1.3 we have now), and to be very closely guided by them when developing an understanding of neural net learning
some related (more meta) thoughts
to do interesting/useful work in learning theory (as of 2024), imo it matters a lot that you think hard about phenomena of interest and try to build theory which lets you make sense of them, as opposed to holding fast to an existing formalism and trying to develop it further / articulate it better / see phenomena in terms of it
this is somewhat downstream of current formalisms imo being bad, it imo being appropriate to think of them more as capturing preliminary toy cases, not as revealing profound things about the phenomena of interest, and imo it being feasible to do better
but what makes sense to do can depend on the person, and it’s also fine to just want to do math lol
and it’s certainly very helpful to know a bunch of math, because that gives you a library in terms of which to build an understanding of phenomena
it’s imo especially great if you’re picking phenomena to be interested in with the future going well around ai in mind
(* but it looks to me like learning theory is unfortunately hard to make relevant to ai alignment[9])
acknowledgments
these thoughts are sorta joint with Jake Mendel and Dmitry Vaintrob (though i’m making no claim about whether they’d endorse the claims). also thank u for discussions: Sam Eisenstat, Clem von Stengel, Lucius Bushnaq, Zach Furman, Alexander Gietelink Oldenziel, Kirke Joamets
with the important caveat that, especially for claims involving ‘circuits’/‘structures’, I think it’s plausible they are made in a frame which will soon be superseded or at least significantly improved/clarified/better-articulated, so it’s a 95% given a frame which is probably silly
train loss in very overparametrized cases is an exception. in this case it might be interesting to note that optima will also be off at infinity if you’re using cross-entropy loss, https://arxiv.org/pdf/2006.06657
also, gradient descent is very far from doing optimal learning in some solomonoff sense — though it can be fruitful to try to draw analogies between the two — and it is also very far from being the best possible practical learning algorithm
by it being a law of large numbers phenomenon, i mean sth like: there are a bunch of structures/circuits/pattern-completers that could be learned, and each one gets learned with a certain probability (or maybe a roughly given total number of these structures gets learned), and loss is roughly some aggregation of indicators for whether each structure gets learned — an aggregation to which the law of large numbers applies
to say more: any concept/thinking-structure in general has to be invented somehow — there in some sense has to be a ‘sensible path’ to that concept — but any local learning process is much more limited than that still — now we’re forced to have a path in some (naively seen) space of possible concepts/thinking-structures, which is a major restriction. eg you might find the right definition in mathematics by looking for a thing satisfying certain constraints (eg you might want the definition to fit into theorems characterizing something you want to characterize), and many such definitions will not be findable by doing sth like gradient descent on definitions
ok, (given an architecture and a loss,) technically each point in the loss landscape will in fact have a different local neighborhood, so in some sense we know that the probability of getting to a point is a function of its neighborhood alone, but what i’m claiming is that it is not nicely/usefully a function of its neighborhood alone. to the extent that stuff about this probability can be nicely deduced from some aspect of the neighborhood, that’s probably ‘logically downstream’ of that aspect of the neighborhood implying something about nice paths to the point.
i think identifying and very clearly understanding any toy example where this shows up would plausibly be better than anything else published in interp this year. the leap complexity paper does something a bit like this but doesn’t really do this
i feel like i should clarify here though that i think basically all existing alignment research fails to relate much to ai alignment. but then i feel like i should further clarify that i think each particular thing sucks at relating to alignment after having thought about how that particular thing could help, not (directly) from some general vague sense of pessimism. i should also say that if i didn’t think interp sucked at relating to alignment, i’d think learning theory sucks less at relating to alignment (ie, not less than interp but less than i currently think it does). but then i feel like i should further say that fortunately you can just think about whether learning theory relates to alignment directly yourself :)
Simon-Pepin Lehalleur weighs in on the DevInterp Discord:
I think his overall position requires taking degeneracies seriously: he seems to be claiming that there is a lot of path dependency in weight space, but very little in function space 😄
In general his position seems broadly compatible with DevInterp:
the development of structures is controlled by loss landscape geometry
and also possibly in more complicated cases by the landscapes of “effective losses” corresponding to subcircuits...
This perspective certainly is incompatible with a naive SGD = Bayes = Watanabe’s global SLT learning process, but I don’t think anyone has (ever? for a long time?) made that claim for non toy models.
It seems that the difference with DevInterp is that
we are more optimistic that it is possible to understand which geometric observables of the landscape control the incremental development of circuits
we expect, based on local SLT considerations, that those observables have to do with the singularity theory of the loss and also of sub/effective losses, with the LLC being the most important but not the only one
we dream that it is possible to bootstrap this to a full fledged S4 correspondence, or at least to get as close as we can.
Ok, no pb. You can also add the following :
I am sympathetic but also unsatisfied with a strong empiricist position about deep learning. It seems to me that it is based on a slightly misapplied physical, and specifically thermodynamical intuition. Namely that we can just observe a neural network and see/easily guess what the relevant “thermodynamic variables” of the system.
For ordinary 3d physical systems, we tend to know or easily discover those thermodynamic variables through simple interactions/observations. But a neural network is an extremely high-dimensional system which we can only “observe” through mathematical tools. The loss is clearly one such thermodynamic variable, but if we expect NN to be in some sense stat mech systems it can’t be the only one (otherwise the learning process would be much more chaotic and unpredictable). One view of DevInterp is that we are “just” looking for those missing variables...
a few thoughts on hyperparams for a better learning theory (for understanding what happens when a neural net is trained with gradient descent)
Having found myself repeating the same points/claims in various conversations about what NN learning is like (especially around singular learning theory), I figured it’s worth writing some of them down. My typical confidence in a claim below is like 95%[1]. I’m not claiming anything here is significantly novel. The claims/points:
local learning (eg gradient descent) strongly does not find global optima. insofar as running a local learning process from many seeds produces outputs with ‘similar’ (train or test) losses, that’s a law of large numbers phenomenon[2], not a consequence of always finding the optimal neural net weights.[3][4]
if your method can’t produce better weights: were you trying to produce better weights by running gradient descent from a bunch of different starting points? getting similar losses this way is a LLN phenomenon
maybe this is a crisp way to see a counterexample instead: train, then identify a ‘lottery ticket’ subnetwork after training like done in that literature. now get rid of all other edges in the network, and retrain that subnetwork either from the previous initialization or from a new initialization — i think this literature says that you get a much worse loss in the latter case. so training from a random initialization here gives a much worse loss than possible
dynamics (kinetics) matter(s). the probability of getting to a particular training endpoint is highly dependent not just on stuff that is evident from the neighborhood of that point, but on there being a way to make those structures incrementally, ie by a sequence of local moves each of which is individually useful.[5][6][7] i think that this is not an academic correction, but a major one — the structures found in practice are very massively those with sensible paths into them and not other (naively) similarly complex structures. some stuff to consider:
the human eye evolving via a bunch of individually sensible steps, https://en.wikipedia.org/wiki/Evolution_of_the_eye
(given a toy setup and in a certain limit,) the hardness of learning a boolean function being characterized by its leap complexity, ie the size of the ‘largest step’ between its fourier terms, https://arxiv.org/pdf/2302.11055
imagine a loss function on a plane which has a crater somewhere and another crater with a valley descending into it somewhere else. the local neighborhoods of the deepest points of the two craters can look the same, but the crater with a valley descending into it will have a massively larger drainage basin. to say more: the crater with a valley is a case where it is first loss-decreasing to build one simple thing, (ie in this case to fix the value of one parameter), and once you’ve done that loss-decreasing to build another simple thing (ie in this case to fix the value of another parameter); getting to the isolated crater is more like having to build two things at once. i think that with a reasonable way to make things precise, the drainage basin of a ‘k-parameter structure’ with no valley descending into it will be exponentially smaller than that of eg a ‘k-parameter structure’ with ‘a k/2-parameter valley’ descending into it, which will be exponentially smaller still than a ‘k-parameter structure’ with a sequence of valleys of slowly increasing dimension descending into it
it seems plausible to me that the right way to think about stuff will end up revealing that in practice there are basically only systems of steps where a single [very small thing]/parameter gets developed/fixed at a time
i’m further guessing that most structures basically have ‘one way’ to descend into them (tho if you consider sufficiently different structures to be the same, then this can be false, like in examples of convergent evolution) and that it’s nice to think of the probability of finding the structure as the product over steps of the probability of making the right choice on that step (of falling in the right part of a partition determining which next thing gets built)
one correction/addition to the above is that it’s probably good to see things in terms of there being many ‘independent’ structures/circuits being formed in parallel, creating some kind of ecology of different structures/circuits. maybe it makes sense to track the ‘effective loss’ created for a structure/circuit by the global loss (typically including weight norm) together with the other structures present at a time? (or can other structures do sufficiently orthogonal things that it’s fine to ignore this correction in some cases?) maybe it’s possible to have structures which were initially independent be combined into larger structures?[8]
everything is a loss phenomenon. if something is ever a something-else phenomenon, that’s logically downstream of a relation between that other thing and loss (but this isn’t to say you shouldn’t be trying to find these other nice things related to loss)
grokking happens basically only in the presence of weight regularization, and it has to do with there being slower structures to form which are eventually more efficient at making logits high (ie more logit bang for weight norm buck)
in the usual case that generalization starts to happen immediately, this has to do with generalizing structures being stronger attractors even at initialization. one consideration at play here is that
nothing interesting ever happens during a random walk on a loss min surface
it’s not clear that i’m conceiving of structures/circuits correctly/well in the above. i think it would help a library of like >10 well-understood toy models (as opposed to like the maybe 1.3 we have now), and to be very closely guided by them when developing an understanding of neural net learning
some related (more meta) thoughts
to do interesting/useful work in learning theory (as of 2024), imo it matters a lot that you think hard about phenomena of interest and try to build theory which lets you make sense of them, as opposed to holding fast to an existing formalism and trying to develop it further / articulate it better / see phenomena in terms of it
this is somewhat downstream of current formalisms imo being bad, it imo being appropriate to think of them more as capturing preliminary toy cases, not as revealing profound things about the phenomena of interest, and imo it being feasible to do better
but what makes sense to do can depend on the person, and it’s also fine to just want to do math lol
and it’s certainly very helpful to know a bunch of math, because that gives you a library in terms of which to build an understanding of phenomena
it’s imo especially great if you’re picking phenomena to be interested in with the future going well around ai in mind
(* but it looks to me like learning theory is unfortunately hard to make relevant to ai alignment[9])
acknowledgments
these thoughts are sorta joint with Jake Mendel and Dmitry Vaintrob (though i’m making no claim about whether they’d endorse the claims). also thank u for discussions: Sam Eisenstat, Clem von Stengel, Lucius Bushnaq, Zach Furman, Alexander Gietelink Oldenziel, Kirke Joamets
with the important caveat that, especially for claims involving ‘circuits’/‘structures’, I think it’s plausible they are made in a frame which will soon be superseded or at least significantly improved/clarified/better-articulated, so it’s a 95% given a frame which is probably silly
train loss in very overparametrized cases is an exception. in this case it might be interesting to note that optima will also be off at infinity if you’re using cross-entropy loss, https://arxiv.org/pdf/2006.06657
also, gradient descent is very far from doing optimal learning in some solomonoff sense — though it can be fruitful to try to draw analogies between the two — and it is also very far from being the best possible practical learning algorithm
by it being a law of large numbers phenomenon, i mean sth like: there are a bunch of structures/circuits/pattern-completers that could be learned, and each one gets learned with a certain probability (or maybe a roughly given total number of these structures gets learned), and loss is roughly some aggregation of indicators for whether each structure gets learned — an aggregation to which the law of large numbers applies
to say more: any concept/thinking-structure in general has to be invented somehow — there in some sense has to be a ‘sensible path’ to that concept — but any local learning process is much more limited than that still — now we’re forced to have a path in some (naively seen) space of possible concepts/thinking-structures, which is a major restriction. eg you might find the right definition in mathematics by looking for a thing satisfying certain constraints (eg you might want the definition to fit into theorems characterizing something you want to characterize), and many such definitions will not be findable by doing sth like gradient descent on definitions
ok, (given an architecture and a loss,) technically each point in the loss landscape will in fact have a different local neighborhood, so in some sense we know that the probability of getting to a point is a function of its neighborhood alone, but what i’m claiming is that it is not nicely/usefully a function of its neighborhood alone. to the extent that stuff about this probability can be nicely deduced from some aspect of the neighborhood, that’s probably ‘logically downstream’ of that aspect of the neighborhood implying something about nice paths to the point.
also note that the points one ends up at in LLM training are not local minima — LLMs aren’t trained to convergence
i think identifying and very clearly understanding any toy example where this shows up would plausibly be better than anything else published in interp this year. the leap complexity paper does something a bit like this but doesn’t really do this
i feel like i should clarify here though that i think basically all existing alignment research fails to relate much to ai alignment. but then i feel like i should further clarify that i think each particular thing sucks at relating to alignment after having thought about how that particular thing could help, not (directly) from some general vague sense of pessimism. i should also say that if i didn’t think interp sucked at relating to alignment, i’d think learning theory sucks less at relating to alignment (ie, not less than interp but less than i currently think it does). but then i feel like i should further say that fortunately you can just think about whether learning theory relates to alignment directly yourself :)
Simon-Pepin Lehalleur weighs in on the DevInterp Discord:
I think his overall position requires taking degeneracies seriously: he seems to be claiming that there is a lot of path dependency in weight space, but very little in function space 😄
In general his position seems broadly compatible with DevInterp:
models learn circuits/algorithmic structure incrementally
the development of structures is controlled by loss landscape geometry
and also possibly in more complicated cases by the landscapes of “effective losses” corresponding to subcircuits...
This perspective certainly is incompatible with a naive SGD = Bayes = Watanabe’s global SLT learning process, but I don’t think anyone has (ever? for a long time?) made that claim for non toy models.
It seems that the difference with DevInterp is that
we are more optimistic that it is possible to understand which geometric observables of the landscape control the incremental development of circuits
we expect, based on local SLT considerations, that those observables have to do with the singularity theory of the loss and also of sub/effective losses, with the LLC being the most important but not the only one
we dream that it is possible to bootstrap this to a full fledged S4 correspondence, or at least to get as close as we can.
Ok, no pb. You can also add the following :
I am sympathetic but also unsatisfied with a strong empiricist position about deep learning. It seems to me that it is based on a slightly misapplied physical, and specifically thermodynamical intuition. Namely that we can just observe a neural network and see/easily guess what the relevant “thermodynamic variables” of the system.
For ordinary 3d physical systems, we tend to know or easily discover those thermodynamic variables through simple interactions/observations. But a neural network is an extremely high-dimensional system which we can only “observe” through mathematical tools. The loss is clearly one such thermodynamic variable, but if we expect NN to be in some sense stat mech systems it can’t be the only one (otherwise the learning process would be much more chaotic and unpredictable). One view of DevInterp is that we are “just” looking for those missing variables...
I’d be curious about hearing your intuition re ” i’m further guessing that most structures basically have ‘one way’ to descend into them”