It sounds like your case for SLT that you make here is basically “it seems heuristically good to generally understand more stuff about how SGD works”. This seems like a reasonable case, though considerably weaker than many other more direct theories of change IMO.
I think you might buy the high level argument for the role of generalisation in alignment, and understand that SLT says things about generalisation, but wonder if that ever cashes out in something useful.
This is a reasonably good description of my view.
It seems fine if the pitch is “we’ll argue for why this is useful later, trust that we have good ideas in mind on the basis of other aspects of our track record”. (This combined with the general “it seems heuristically good to understand stuff better in general” theory of change is enough to motivate some people working on this IMO.)
To judge that empirical work by the standard of other empirical work divorced from a deeper set of claims, i.e. purely by “the stuff that it finds”, is to miss the point
To be clear, my view isn’t that this empirical work doesn’t demonstrate something interesting. (I agree that it helps to demonstrate that SLT has grounding in reality.) My claim was just that it doesn’t demonstrate that SLT is useful. And that would require additional hopes (which don’t yet seem well articulated or plausible to me).
When I said “I find the examples of empirical work you give uncompelling because they were all cases where we could have answered all the relevant questions using empirics and they aren’t analogous to a case where we can’t just check empirically.”, I was responding to the fact that the corresponding section in the original post starts with “How useful is this in practice, really?”. This work doesn’t demonstrate usefulness, it demonstrates that the theory makes some non-trivial correct predictions.
(That said, the predictions in the small transformer case are about easy to determine properties that show up on basically any test of “is something large changing in the network” AFAICT. Maybe some of the other papers make more subtle predictions?)
(I have edited my original comment to make this distinction more clear, given that this distinction is important and might be confusing.)
In terms of more subtle predictions. In the Berkeley Primer in mid-2023, based on elementary manipulations of the free energy formula, I predicted we should see phase transitions / developmental stages where the loss stays relatively constant but the LLC (model complexity) decreases.
We noticed one such stage in the language models, and two in the linear regression transformers in the developmental landscape paper. We only partially understood them there, but we’ve seen more behaviour like this in the upcoming work I mentioned in my other post, and we feel more comfortable now linking it to phenomena like “pruning” in developmental neuroscience. This suggests some interesting connections with loss of plasticity (i.e. we see many components have LLC curves that go up, then come down, and one would predict after this decrease the components are more resistent to being changed by further training).
These are potentially consequential changes in model computation that are (in these examples) arguably not noticeable in the loss curve, and it’s not obvious to me how you would be confident to notice this from other metrics you would have thought to track (in each case they might correspond with something, like say magnitude of layer norm weights, but it’s unclear to me out of all the thousands of things you could measure why you would a priori associate any one such signal with a change in model computation unless you knew it was linked to the LLC curve). Things like the FIM trace or Hessian trace might also reflect the change. However in the second such stage in the linear regression transformer (LR4) this seems not to be the case.
It sounds like your case for SLT that you make here is basically “it seems heuristically good to generally understand more stuff about how SGD works”. This seems like a reasonable case, though considerably weaker than many other more direct theories of change IMO.
This is a reasonably good description of my view.
It seems fine if the pitch is “we’ll argue for why this is useful later, trust that we have good ideas in mind on the basis of other aspects of our track record”. (This combined with the general “it seems heuristically good to understand stuff better in general” theory of change is enough to motivate some people working on this IMO.)
To be clear, my view isn’t that this empirical work doesn’t demonstrate something interesting. (I agree that it helps to demonstrate that SLT has grounding in reality.) My claim was just that it doesn’t demonstrate that SLT is useful. And that would require additional hopes (which don’t yet seem well articulated or plausible to me).
When I said “I find the examples of empirical work you give uncompelling because they were all cases where we could have answered all the relevant questions using empirics and they aren’t analogous to a case where we can’t just check empirically.”, I was responding to the fact that the corresponding section in the original post starts with “How useful is this in practice, really?”. This work doesn’t demonstrate usefulness, it demonstrates that the theory makes some non-trivial correct predictions.
(That said, the predictions in the small transformer case are about easy to determine properties that show up on basically any test of “is something large changing in the network” AFAICT. Maybe some of the other papers make more subtle predictions?)
(I have edited my original comment to make this distinction more clear, given that this distinction is important and might be confusing.)
In terms of more subtle predictions. In the Berkeley Primer in mid-2023, based on elementary manipulations of the free energy formula, I predicted we should see phase transitions / developmental stages where the loss stays relatively constant but the LLC (model complexity) decreases.
We noticed one such stage in the language models, and two in the linear regression transformers in the developmental landscape paper. We only partially understood them there, but we’ve seen more behaviour like this in the upcoming work I mentioned in my other post, and we feel more comfortable now linking it to phenomena like “pruning” in developmental neuroscience. This suggests some interesting connections with loss of plasticity (i.e. we see many components have LLC curves that go up, then come down, and one would predict after this decrease the components are more resistent to being changed by further training).
These are potentially consequential changes in model computation that are (in these examples) arguably not noticeable in the loss curve, and it’s not obvious to me how you would be confident to notice this from other metrics you would have thought to track (in each case they might correspond with something, like say magnitude of layer norm weights, but it’s unclear to me out of all the thousands of things you could measure why you would a priori associate any one such signal with a change in model computation unless you knew it was linked to the LLC curve). Things like the FIM trace or Hessian trace might also reflect the change. However in the second such stage in the linear regression transformer (LR4) this seems not to be the case.