Nice work! It’s great to have tests on how well one can approximate the learning coefficient in practice and how the coefficient corresponds to high-level properties. This post is perhaps the best illustration of SLT’s applicability to practical problems that I know of, so thank you.
Question about the phase transitions: I don’t quite see the connection between the learning coefficient and phase transitions. You write:
In particular, these unstable measurements “notice” the grokking transition between memorization and generalization when training loss stabilizes and test loss goes down. (As our networks are quite efficient, this happens relatively early in training.)
For the first 5 checkpoints the test loss goes up, after which it goes (sharply) down. However, looking at the learning coefficient in the first 5 to 10 checkpoints, I can’t really pinpoint “ah, that’s where the model starts to generalize”. Sure, the learning coefficient starts to go more sharply up, but this seems like it could be explained by the training loss going down, no?
Nice work! It’s great to have tests on how well one can approximate the learning coefficient in practice and how the coefficient corresponds to high-level properties. This post is perhaps the best illustration of SLT’s applicability to practical problems that I know of, so thank you.
Question about the phase transitions: I don’t quite see the connection between the learning coefficient and phase transitions. You write:
For the first 5 checkpoints the test loss goes up, after which it goes (sharply) down. However, looking at the learning coefficient in the first 5 to 10 checkpoints, I can’t really pinpoint “ah, that’s where the model starts to generalize”. Sure, the learning coefficient starts to go more sharply up, but this seems like it could be explained by the training loss going down, no?