discrete phases, and the Developmental Landscape paper validates this
Hmm, the phases seem only roughly discrete, and I think a perspective like the multi-component learning perspective totally explains these results, makes stronger predictions, and seems easier to reason about (at least for me).
I would say something like:
The empirical results in the paper paper indicate that with a tiny (3 M) transformer with learned positional embeddings:
The model initially doesn’t use positional embeddings and doesn’t know common 2-4 grams. So, it probably is basically just learning bigrams to start.
Later, positional embeddings become useful and steadily get more useful over time. At the same time, it learns common 2-4 grams (among other things). (This is now possible as it has positional embeddings.)
Later, the model learns a head which almost entirely attends to the previous token. At the same time as this is happening, ICL score goes down and the model learns heads which do something like induction (as well as probably doing a bunch of other stuff). (It also learns a bunch of other stuff at the same point.)
So, I would say the results are “several capabilities of tiny LLMs require other components, so you see phases (aka s-shaped loss curves) based on when these other components come into play”. (Again, see multi-component learning and s-shaped loss curves which makes this exact prediction.)
My (not confident) impression is a priori people didn’t expect this discrete-phases thing to hold
I mean, it will depend how a priori you mean. I again think that the perspective in multi-component learning and s-shaped loss curves explains what it going on. This was inspired by various emprical results (e.g. results around an s-shape in induction-like-head formation).
but now I’m leaning towards giving the field time to mature
Seems fine to give the field time to mature. That said, if there isn’t a theory of change better than “it seems good to generally understand how NN learning works from a theory perspective” (which I’m not yet sold on) or more compelling empirical demos, I don’t think this is super compelling. I think it seems worth some people with high comparative advantage working on this, but not a great pitch. (Current level of relative investment seems maybe a bit high to me but not crazy. That said, idk.)
Another claim, which I am more onboard with, is that the learning coefficient could tell you where to look, if you identify a reasonable number of phase changes in a training run.
I don’t expect things to localize interestingly for the behaviors we really care about. As in, I expect that the behaviors we care about are learned diffusely across a high fraction of parameters and are learned in a way which either isn’t well described as a phase transition or which involves a huge number of tiny phase transitions of varying size which average out into something messier.
(And getting the details right will be important! I don’t think it will be fine to get 1⁄3 of the effect size if you want to understand things well enough to be useful.)
I think most phase transitions do not occur over a high fraction of training, but instead happen over relatively few SGD steps.
All known phase transitions[1] seem to happen across a reasonably high (>5%?) fraction of log-training steps.[2]
More precisely, “things which seem sort like phase transitions” (e.g. s-shaped loss curves). I don’t know if these are really phase transitions for some more precise definition.
Hmm, the phases seem only roughly discrete, and I think a perspective like the multi-component learning perspective totally explains these results, makes stronger predictions, and seems easier to reason about (at least for me).
I would say something like:
The empirical results in the paper paper indicate that with a tiny (3 M) transformer with learned positional embeddings:
The model initially doesn’t use positional embeddings and doesn’t know common 2-4 grams. So, it probably is basically just learning bigrams to start.
Later, positional embeddings become useful and steadily get more useful over time. At the same time, it learns common 2-4 grams (among other things). (This is now possible as it has positional embeddings.)
Later, the model learns a head which almost entirely attends to the previous token. At the same time as this is happening, ICL score goes down and the model learns heads which do something like induction (as well as probably doing a bunch of other stuff). (It also learns a bunch of other stuff at the same point.)
So, I would say the results are “several capabilities of tiny LLMs require other components, so you see phases (aka s-shaped loss curves) based on when these other components come into play”. (Again, see multi-component learning and s-shaped loss curves which makes this exact prediction.)
I mean, it will depend how a priori you mean. I again think that the perspective in multi-component learning and s-shaped loss curves explains what it going on. This was inspired by various emprical results (e.g. results around an s-shape in induction-like-head formation).
Seems fine to give the field time to mature. That said, if there isn’t a theory of change better than “it seems good to generally understand how NN learning works from a theory perspective” (which I’m not yet sold on) or more compelling empirical demos, I don’t think this is super compelling. I think it seems worth some people with high comparative advantage working on this, but not a great pitch. (Current level of relative investment seems maybe a bit high to me but not crazy. That said, idk.)
I don’t expect things to localize interestingly for the behaviors we really care about. As in, I expect that the behaviors we care about are learned diffusely across a high fraction of parameters and are learned in a way which either isn’t well described as a phase transition or which involves a huge number of tiny phase transitions of varying size which average out into something messier.
(And getting the details right will be important! I don’t think it will be fine to get 1⁄3 of the effect size if you want to understand things well enough to be useful.)
All known phase transitions[1] seem to happen across a reasonably high (>5%?) fraction of log-training steps.[2]
More precisely, “things which seem sort like phase transitions” (e.g. s-shaped loss curves). I don’t know if these are really phase transitions for some more precise definition.
Putting aside pathological training runs like training a really tiny model (e.g. 3 million params) on 10^20 tokens or something.