This already strongly suggests some connection between induction heads and in-context learning, but beyond just that, it appears this window is a pivotal point for the training process in general: whatever’s occurring is visible as a bump on the training curve (figure below). It is in fact the only place in training where the loss is not convex (monotonically decreasing in slope).
I can see the bump, but it’s not the only one. The two layer graph has a second similar bump, which also exists in the one layer model, and I think I can also see it very faintly in the three level model. Did they ignore the second bump because it only exists in small models, while their bump continues to exist in bigger models?
I’m reading In-context Learning and Induction Heads (transformer-circuits.pub)
I can see the bump, but it’s not the only one. The two layer graph has a second similar bump, which also exists in the one layer model, and I think I can also see it very faintly in the three level model. Did they ignore the second bump because it only exists in small models, while their bump continues to exist in bigger models?