phase transitions organize learning and that detecting, locating, and understanding these transitions
Only the third ability sounds like it could be useful for alignment. The others seem like they’d be firing all the time, and if you only had that and, for instance, the ability to understand the semantics of a single neuron, then I don’t think that gets you very far. Or am I missing something obvious?
I think it is too early to know how many phase transitions there are in e.g. the training of a large language model. If there are many, it seems likely to me that they fall along a spectrum of “scale” and that it will be easier to find the more significant ones than the less significant ones (e.g. we discover transitions like the onset of in-context learning first, because they dramatically change how the whole network computes).
As evidence for that view, I would put forward the fact that putting features into superposition is known to be a phase transition in toy models (based on the original post by Elhage et al and also our work in Chen et al) and therefore seems likely to be a phase transition in larger models as well. That gives an example of phase transitions at the “small” end of the scale. At the “big” end of the scale, the evidence in Olsson et al that induction heads and in-context learning appears in a phase transition seems convincing to me.
On general principles, understanding “small” phase transitions (where the scale is judged relative to the overall size of the system, e.g. number of parameters) is like probing a physical system at small length scales / high energy, and will require more sophisticated tools. So I expect that we’ll start by gaining a good understanding of “big” phase transitions and then as the experimental methodology and theory improves, move down the spectrum towards smaller transitions.
On these grounds I don’t expect us to be swamped by the smaller transitions, because they’re just hard to see in the first place; the major open problem in my mind is how far we can get down the scale with reasonable amounts of compute. Maybe one way that SLT & developmental interpretability fails to be useful for alignment is if there is a large “gap” in the spectrum, where beyond the “big” phase transitions that are easy to see (and for which you may not need fancy new ideas) there is just a desert / lack of transitions, and all the transitions that matter for alignment are “small” enough that a lot of compute and/or very sophisticated ideas are necessary to study them. We’ll see!
Thank you, that was helpful. If I’m getting this right, you think the “big” transitions plausibly correspond to important capability gains. So under that theory, “chain of thought” and “reflection” arised due to big phase transitions in GPT-3 and 4. I think it’d be great if researchers could, if not access training checkpoints of these models, then at least make bids for experiments to be performed on said models.
Only the third ability sounds like it could be useful for alignment. The others seem like they’d be firing all the time, and if you only had that and, for instance, the ability to understand the semantics of a single neuron, then I don’t think that gets you very far. Or am I missing something obvious?
I think it is too early to know how many phase transitions there are in e.g. the training of a large language model. If there are many, it seems likely to me that they fall along a spectrum of “scale” and that it will be easier to find the more significant ones than the less significant ones (e.g. we discover transitions like the onset of in-context learning first, because they dramatically change how the whole network computes).
As evidence for that view, I would put forward the fact that putting features into superposition is known to be a phase transition in toy models (based on the original post by Elhage et al and also our work in Chen et al) and therefore seems likely to be a phase transition in larger models as well. That gives an example of phase transitions at the “small” end of the scale. At the “big” end of the scale, the evidence in Olsson et al that induction heads and in-context learning appears in a phase transition seems convincing to me.
On general principles, understanding “small” phase transitions (where the scale is judged relative to the overall size of the system, e.g. number of parameters) is like probing a physical system at small length scales / high energy, and will require more sophisticated tools. So I expect that we’ll start by gaining a good understanding of “big” phase transitions and then as the experimental methodology and theory improves, move down the spectrum towards smaller transitions.
On these grounds I don’t expect us to be swamped by the smaller transitions, because they’re just hard to see in the first place; the major open problem in my mind is how far we can get down the scale with reasonable amounts of compute. Maybe one way that SLT & developmental interpretability fails to be useful for alignment is if there is a large “gap” in the spectrum, where beyond the “big” phase transitions that are easy to see (and for which you may not need fancy new ideas) there is just a desert / lack of transitions, and all the transitions that matter for alignment are “small” enough that a lot of compute and/or very sophisticated ideas are necessary to study them. We’ll see!
Thank you, that was helpful. If I’m getting this right, you think the “big” transitions plausibly correspond to important capability gains. So under that theory, “chain of thought” and “reflection” arised due to big phase transitions in GPT-3 and 4. I think it’d be great if researchers could, if not access training checkpoints of these models, then at least make bids for experiments to be performed on said models.
That’s what we’re thinking, yeah.