In particular, we’re aiming to understand phase transitions during learning — moments in which models go through sudden, unanticipated shifts. We want to be able to detect these transitions because we think most of the risk is contained in sudden, undesired changes to values or capabilities that occur during training
Maybe a basic question, but why do you think dangerous capabilities will exhibit discrete phase transitions? My impression is that grokking indicates that generalizing subnetworks can be continuously upweighted over a training run, and I don’t know at what point the subnetwork is “implemented” or not, as opposed to “more or less strongly weighted.”
We don’t necessarily expect all dangerous capabilities to exhibit phase transitions. The ones that do are more dangerous because we can’t anticipate them, so this just seems like the most important place to start.
It’s an open question to what extent the lottery-ticket style story of a subnetwork being continually upweighted contradicts (or supports) the phase transition perspective. Just because a subnetwork’s strength is growing constantly doesn’t mean its effect on the overall computation is. Rather than grokking, which is a very specific kind of phase transition, it’s probably better to have in mind the emergence of in-context learning in tandem with induction heads, which seems to us more like the typical case we’re interested in when we speak about structure in neural networks developing across training.
We expect there to be a deeper relation between degeneracy and structure. As an intuition pump, think of a code base where you have two modules communicating across some API. Often, you can change the interface between these two modules without changing the information content being passed between them and without changing their internal structure. Degeneracy — the ways in which you can change your interfaces — tells you something about the structure of these circuits, the boundaries between them, and maybe more. We’ll have more to say about this in the future.
it’s probably better to have in mind the emergence of in-context learning in tandem with induction heads, which seems to us more like the typical case we’re interested in when we speak about structure in neural networks developing across training.
The induction-bump seems like a good test case for the Bayesian basin interpretation.
One would really want to know if the complexity measure can predict ‘emergence’ of capabilities like inner-monologue, particularly if you can spot previously-unknown capabilities emerging which may not be covered in any of your existing benchmarks. But this type of ‘emergence’ tends to happen with such expensive models that the available checkpoints are too separated to be informative (if you get an emergence going from 1b vs 10b vs 100b, what does it mean to compute a complexity measure there? You’d really want to compare them at wherever the emergence actually really happens, like 73.5b vs 74b, or whatever.)
But the induction bump happens at pretty small (ie. cheap) model sizes, so it could be replicated many times and in many ways within-training-run and across training-runs, and one see how the complexity metric reflects or predicts the induction bump. Is that one of the ‘hidden’ transitions you plan to test? And if not, why not?
Our work on the induction bump is now out. We find several additional “hidden” transitions, including one that splits the induction bump in two: a first part where previous-token heads start forming, and a second part where the rest of the induction circuit finishes forming.
The first substage is a type-B transition (loss changing only slightly, complexity decreasing). The second substage is a more typical type-A transition (loss decreasing, complexity increasing). We’re still unclear about how to understand this type-B transition structurally. How is the model simplifying? E.g., is there some link between attention heads composing and the basin broadening?
One would really want to know if the complexity measure can predict ‘emergence’ of capabilities like inner-monologue, particularly if you can spot previously-unknown capabilities emerging which may not be covered in any of your existing benchmarks.
That’s our hope as well. Early ongoing work on toy transformers trained to perform linear regression seems to bear out that lambdahat can reveal transitions where the loss can’t.
But this type of ‘emergence’ tends to happen with such expensive models that the available checkpoints are too separated to be informative (if you get an emergence going from 1b vs 10b vs 100b, what does it mean to compute a complexity measure there? You’d really want to compare them at wherever the emergence actually really happens, like 73.5b vs 74b, or whatever.)
The kind of emergence we’re currently most interested in is emergence over training time, which makes studying these transitions much more tractable (the main cost you’re paying is storage for checkpoints, and storage is cheap). It’s still a hurdle in that we have to start training large models ourselves (or setting up collaborations with other labs).
But the induction bump happens at pretty small (ie. cheap) model sizes, so it could be replicated many times and in many ways within-training-run and across training-runs, and one see how the complexity metric reflects or predicts the induction bump. Is that one of the ‘hidden’ transitions you plan to test? And if not, why not?
The induction bump is one of the main things we’re looking into now.
Maybe a basic question, but why do you think dangerous capabilities will exhibit discrete phase transitions? My impression is that grokking indicates that generalizing subnetworks can be continuously upweighted over a training run, and I don’t know at what point the subnetwork is “implemented” or not, as opposed to “more or less strongly weighted.”
We don’t necessarily expect all dangerous capabilities to exhibit phase transitions. The ones that do are more dangerous because we can’t anticipate them, so this just seems like the most important place to start.
It’s an open question to what extent the lottery-ticket style story of a subnetwork being continually upweighted contradicts (or supports) the phase transition perspective. Just because a subnetwork’s strength is growing constantly doesn’t mean its effect on the overall computation is. Rather than grokking, which is a very specific kind of phase transition, it’s probably better to have in mind the emergence of in-context learning in tandem with induction heads, which seems to us more like the typical case we’re interested in when we speak about structure in neural networks developing across training.
We expect there to be a deeper relation between degeneracy and structure. As an intuition pump, think of a code base where you have two modules communicating across some API. Often, you can change the interface between these two modules without changing the information content being passed between them and without changing their internal structure. Degeneracy — the ways in which you can change your interfaces — tells you something about the structure of these circuits, the boundaries between them, and maybe more. We’ll have more to say about this in the future.
The induction-bump seems like a good test case for the Bayesian basin interpretation.
One would really want to know if the complexity measure can predict ‘emergence’ of capabilities like inner-monologue, particularly if you can spot previously-unknown capabilities emerging which may not be covered in any of your existing benchmarks. But this type of ‘emergence’ tends to happen with such expensive models that the available checkpoints are too separated to be informative (if you get an emergence going from 1b vs 10b vs 100b, what does it mean to compute a complexity measure there? You’d really want to compare them at wherever the emergence actually really happens, like 73.5b vs 74b, or whatever.)
But the induction bump happens at pretty small (ie. cheap) model sizes, so it could be replicated many times and in many ways within-training-run and across training-runs, and one see how the complexity metric reflects or predicts the induction bump. Is that one of the ‘hidden’ transitions you plan to test? And if not, why not?
Our work on the induction bump is now out. We find several additional “hidden” transitions, including one that splits the induction bump in two: a first part where previous-token heads start forming, and a second part where the rest of the induction circuit finishes forming.
The first substage is a type-B transition (loss changing only slightly, complexity decreasing). The second substage is a more typical type-A transition (loss decreasing, complexity increasing). We’re still unclear about how to understand this type-B transition structurally. How is the model simplifying? E.g., is there some link between attention heads composing and the basin broadening?
That’s our hope as well. Early ongoing work on toy transformers trained to perform linear regression seems to bear out that lambdahat can reveal transitions where the loss can’t.
The kind of emergence we’re currently most interested in is emergence over training time, which makes studying these transitions much more tractable (the main cost you’re paying is storage for checkpoints, and storage is cheap). It’s still a hurdle in that we have to start training large models ourselves (or setting up collaborations with other labs).
The induction bump is one of the main things we’re looking into now.
Looks like it’s in-progress.