(I’m again not an SLT expert, and hence one shouldn’t assume I’m able to give the strongest arguments for it. But I feel like this comment deserves some response, so:)
I find the examples of empricial work you give uncompelling because they were all cases where we could have answered all the relevant questions using empirics and they aren’t analogous to a case where we can’t just check empirically.
I basically agree that SLT hasn’t yet provided deep concrete information about a real trained ML model that we couldn’t have obtained via other means. I think this isn’t as bad as (I think) you imply, though. Some reasons:
SLT/devinterp makes the claim that training consists of discrete phases, and the Developmental Landscape paper validates this. Very likely we could have determined this for the 3M-transformer via others means, but:
My (not confident) impression is a priori people didn’t expect this discrete-phases thing to hold, except for those who expected so because of SLT.
(Plausibly there’s tacit knowledge in this direction in the mech interp field that I don’t know of; correct me if this is the case.)
The SLT approach here is conceptually simple and principled, in a way that seems like it could scale, in contrast to “using empirics”.
I currently view of the empirical work as validating that the theoretical ideas actually work in practice, not as providing ready insight to models.
Of course, you don’t want to tinker forever with toy cases, you actually should demonstrate your value by doing something no one else can, etc.
I’d be very sympathetic to criticism about not-providing-substantial-new-insight-that-we-couldn’t-easily-have-obtained-otherwise 2-3 years from now; but now I’m leaning towards giving the field time to mature.
For the case of the paper looking at a small transformer (and when various abilities emerge), we can just check when a given model is good at various things across training if we wanted to know that. And, separately, I don’t see a reason why knowing what a transformer is good at in this way is that useful.
My sense is that SLT is supposed to give you deeper knowledge than what you get by simply checking the model’s behavior (or, giving knowledge more scalably). I don’t have a great picture of this myself, and am somewhat skeptical of its feasibility. I’ve e.g. heard of talk about quantifying generalization via the learning coefficient, and while understanding the extent to which models generalize seems great, I’m not sure how one beats behavioral evaluations here.
Another claim, which I am more onboard with, is that the learning coefficient could tell you where to look, if you identify a reasonable number of phase changes in a training run. (I’ve heard some talk of also looking at the learning coefficient w.r.t. a subset of weights,or a subset of data, to get more fine-grained information.) I feel like this has value.
Alice: Ok, I have some thoughts on the detecting/classifying phase transitions application. Surely during the interesting part of training, phase transitions aren’t at all localized and are just constantly going on everywhere? So, you’ll already need to have some way of cutting the model into parts such that these parts are cleaved nicely by phase transitions in some way. Why think such a decomposition exists? Also, shouldn’t you just expect that there are many/most “phase transitions” which are just occuring over a reasonably high fraction of training? (After all, performance is often the average of many, many sigmoids.)
If I put on my SLT goggles, I think most phase transitions do not occur over a high fraction of training, but instead happen over relatively few SGD steps.
I’m not sure what Alice means by “phase transitions [...] are just constantly going on everywhere”. But: probably it makes sense to think that somewhat different “parts” of the model are affected by training on Github vs. Harry Potter fanfiction, and one would want a theory of phase changes be able to deal with that. (Cf. talk about learning coefficients for subsets of weights/data above.) I don’t have strong arguments for expecting this to be feasible.
discrete phases, and the Developmental Landscape paper validates this
Hmm, the phases seem only roughly discrete, and I think a perspective like the multi-component learning perspective totally explains these results, makes stronger predictions, and seems easier to reason about (at least for me).
I would say something like:
The empirical results in the paper paper indicate that with a tiny (3 M) transformer with learned positional embeddings:
The model initially doesn’t use positional embeddings and doesn’t know common 2-4 grams. So, it probably is basically just learning bigrams to start.
Later, positional embeddings become useful and steadily get more useful over time. At the same time, it learns common 2-4 grams (among other things). (This is now possible as it has positional embeddings.)
Later, the model learns a head which almost entirely attends to the previous token. At the same time as this is happening, ICL score goes down and the model learns heads which do something like induction (as well as probably doing a bunch of other stuff). (It also learns a bunch of other stuff at the same point.)
So, I would say the results are “several capabilities of tiny LLMs require other components, so you see phases (aka s-shaped loss curves) based on when these other components come into play”. (Again, see multi-component learning and s-shaped loss curves which makes this exact prediction.)
My (not confident) impression is a priori people didn’t expect this discrete-phases thing to hold
I mean, it will depend how a priori you mean. I again think that the perspective in multi-component learning and s-shaped loss curves explains what it going on. This was inspired by various emprical results (e.g. results around an s-shape in induction-like-head formation).
but now I’m leaning towards giving the field time to mature
Seems fine to give the field time to mature. That said, if there isn’t a theory of change better than “it seems good to generally understand how NN learning works from a theory perspective” (which I’m not yet sold on) or more compelling empirical demos, I don’t think this is super compelling. I think it seems worth some people with high comparative advantage working on this, but not a great pitch. (Current level of relative investment seems maybe a bit high to me but not crazy. That said, idk.)
Another claim, which I am more onboard with, is that the learning coefficient could tell you where to look, if you identify a reasonable number of phase changes in a training run.
I don’t expect things to localize interestingly for the behaviors we really care about. As in, I expect that the behaviors we care about are learned diffusely across a high fraction of parameters and are learned in a way which either isn’t well described as a phase transition or which involves a huge number of tiny phase transitions of varying size which average out into something messier.
(And getting the details right will be important! I don’t think it will be fine to get 1⁄3 of the effect size if you want to understand things well enough to be useful.)
I think most phase transitions do not occur over a high fraction of training, but instead happen over relatively few SGD steps.
All known phase transitions[1] seem to happen across a reasonably high (>5%?) fraction of log-training steps.[2]
More precisely, “things which seem sort like phase transitions” (e.g. s-shaped loss curves). I don’t know if these are really phase transitions for some more precise definition.
(I’m again not an SLT expert, and hence one shouldn’t assume I’m able to give the strongest arguments for it. But I feel like this comment deserves some response, so:)
I basically agree that SLT hasn’t yet provided deep concrete information about a real trained ML model that we couldn’t have obtained via other means. I think this isn’t as bad as (I think) you imply, though. Some reasons:
SLT/devinterp makes the claim that training consists of discrete phases, and the Developmental Landscape paper validates this. Very likely we could have determined this for the 3M-transformer via others means, but:
My (not confident) impression is a priori people didn’t expect this discrete-phases thing to hold, except for those who expected so because of SLT.
(Plausibly there’s tacit knowledge in this direction in the mech interp field that I don’t know of; correct me if this is the case.)
The SLT approach here is conceptually simple and principled, in a way that seems like it could scale, in contrast to “using empirics”.
I currently view of the empirical work as validating that the theoretical ideas actually work in practice, not as providing ready insight to models.
Of course, you don’t want to tinker forever with toy cases, you actually should demonstrate your value by doing something no one else can, etc.
I’d be very sympathetic to criticism about not-providing-substantial-new-insight-that-we-couldn’t-easily-have-obtained-otherwise 2-3 years from now; but now I’m leaning towards giving the field time to mature.
My sense is that SLT is supposed to give you deeper knowledge than what you get by simply checking the model’s behavior (or, giving knowledge more scalably). I don’t have a great picture of this myself, and am somewhat skeptical of its feasibility. I’ve e.g. heard of talk about quantifying generalization via the learning coefficient, and while understanding the extent to which models generalize seems great, I’m not sure how one beats behavioral evaluations here.
Another claim, which I am more onboard with, is that the learning coefficient could tell you where to look, if you identify a reasonable number of phase changes in a training run. (I’ve heard some talk of also looking at the learning coefficient w.r.t. a subset of weights, or a subset of data, to get more fine-grained information.) I feel like this has value.
If I put on my SLT goggles, I think most phase transitions do not occur over a high fraction of training, but instead happen over relatively few SGD steps.
I’m not sure what Alice means by “phase transitions [...] are just constantly going on everywhere”. But: probably it makes sense to think that somewhat different “parts” of the model are affected by training on Github vs. Harry Potter fanfiction, and one would want a theory of phase changes be able to deal with that. (Cf. talk about learning coefficients for subsets of weights/data above.) I don’t have strong arguments for expecting this to be feasible.
Hmm, the phases seem only roughly discrete, and I think a perspective like the multi-component learning perspective totally explains these results, makes stronger predictions, and seems easier to reason about (at least for me).
I would say something like:
The empirical results in the paper paper indicate that with a tiny (3 M) transformer with learned positional embeddings:
The model initially doesn’t use positional embeddings and doesn’t know common 2-4 grams. So, it probably is basically just learning bigrams to start.
Later, positional embeddings become useful and steadily get more useful over time. At the same time, it learns common 2-4 grams (among other things). (This is now possible as it has positional embeddings.)
Later, the model learns a head which almost entirely attends to the previous token. At the same time as this is happening, ICL score goes down and the model learns heads which do something like induction (as well as probably doing a bunch of other stuff). (It also learns a bunch of other stuff at the same point.)
So, I would say the results are “several capabilities of tiny LLMs require other components, so you see phases (aka s-shaped loss curves) based on when these other components come into play”. (Again, see multi-component learning and s-shaped loss curves which makes this exact prediction.)
I mean, it will depend how a priori you mean. I again think that the perspective in multi-component learning and s-shaped loss curves explains what it going on. This was inspired by various emprical results (e.g. results around an s-shape in induction-like-head formation).
Seems fine to give the field time to mature. That said, if there isn’t a theory of change better than “it seems good to generally understand how NN learning works from a theory perspective” (which I’m not yet sold on) or more compelling empirical demos, I don’t think this is super compelling. I think it seems worth some people with high comparative advantage working on this, but not a great pitch. (Current level of relative investment seems maybe a bit high to me but not crazy. That said, idk.)
I don’t expect things to localize interestingly for the behaviors we really care about. As in, I expect that the behaviors we care about are learned diffusely across a high fraction of parameters and are learned in a way which either isn’t well described as a phase transition or which involves a huge number of tiny phase transitions of varying size which average out into something messier.
(And getting the details right will be important! I don’t think it will be fine to get 1⁄3 of the effect size if you want to understand things well enough to be useful.)
All known phase transitions[1] seem to happen across a reasonably high (>5%?) fraction of log-training steps.[2]
More precisely, “things which seem sort like phase transitions” (e.g. s-shaped loss curves). I don’t know if these are really phase transitions for some more precise definition.
Putting aside pathological training runs like training a really tiny model (e.g. 3 million params) on 10^20 tokens or something.