The smooth graphs seem like good evidence that there are much smoother underlying changes in the model, and that the abruptness of the change is about behavior or evaluation rather than what gradient descent is learning.
That does not seem true to me and as much of a leap as OP. A priori, if I see a smooth curve in one metric and a discontinuous or abrupt change in another, I do not see how that should make me more confident that it is ‘about behavior or evaluation’. Why should I conclude that? Why can’t it reflect a non-smooth underlying change in the model first? I would only conclude that if I had already ruled out internal changes because I was already committed to the position that NNs can only learn and change internally in smooth small ways… which unfortunately we already know is a false position, because of things like Anthropic’s induction bump, which show phase transitions in the internals of the model which is nearly invisible on the loss. (And also, incidentally, because the bump is so small and the training curve still so smooth, falsifies the more modest claim that small changes in perplexity must reflect small changes in the model internals—maybe usually small changes do not reflect non-smooth underlying changes, but nevertheless, it is entirely possible and does happen, and we would surely find many more routine examples if we had better interpretability so examining a single instance didn’t take man-years.) And also a priori, from the old statistical mechanics literature, you should expect abrupt phase changes of various sorts in NN models (which may or may not be visible in the training curve), like parity models, where the task is so simple and clearly defined that it cannot have anything to do with the ‘behavior’ or ‘evaluation’ being wrong, and comes from effects like symmetry-breaking (often associated with plateaus and flat curves...).
If perplexity on a task is gradually decreasing then I think that’s probably produced some underlying gradual change in the model (which may be the sum of a ton of tiny discrete changes).
If accuracy and log loss are both improving, I think that’s most likely due to the same underlying phenomenon. That’s not nearly as obvious—it could be that there are two separate phenomena, and one gives rise to gradual improvements in perplexity without affecting accuracy while the other gives rise to abrupt improvements in accuracy without reflecting perplexity—but it still seems like a very natural guess.
The induction bump in particular seems to involve accuracy and log loss improving together, unsurprisingly.
Of course the induction behavior is just one small driver of log loss and so it corresponds to a small blip on the loss or accuracy curves overall, while corresponding to a big jump on some subtasks. In a larger model there are likely to be many events like this that don’t correspond to any blip at all in the overall loss curve while being important for a subtask. This seems unlikely to be the driver of the difference for the BIG bench tasks under discussion, since the continuous log probability improvements and discontinuous accuracy improvements are being measured on the same distribution.
In the case of parities, I think there is a smooth underlying change in the model, e.g. see figure 3 in this paper. I agree that (i) such changes are not always visible in perplexity, e.g. for parities, and therefore it’s not obvious that you will know where to look for them even if they exist, (ii) it’s not obvious whether they always exist, we just know about a few cases we’ve studied like parities and grokking.
That does not seem true to me and as much of a leap as OP. A priori, if I see a smooth curve in one metric and a discontinuous or abrupt change in another, I do not see how that should make me more confident that it is ‘about behavior or evaluation’. Why should I conclude that? Why can’t it reflect a non-smooth underlying change in the model first? I would only conclude that if I had already ruled out internal changes because I was already committed to the position that NNs can only learn and change internally in smooth small ways… which unfortunately we already know is a false position, because of things like Anthropic’s induction bump, which show phase transitions in the internals of the model which is nearly invisible on the loss. (And also, incidentally, because the bump is so small and the training curve still so smooth, falsifies the more modest claim that small changes in perplexity must reflect small changes in the model internals—maybe usually small changes do not reflect non-smooth underlying changes, but nevertheless, it is entirely possible and does happen, and we would surely find many more routine examples if we had better interpretability so examining a single instance didn’t take man-years.) And also a priori, from the old statistical mechanics literature, you should expect abrupt phase changes of various sorts in NN models (which may or may not be visible in the training curve), like parity models, where the task is so simple and clearly defined that it cannot have anything to do with the ‘behavior’ or ‘evaluation’ being wrong, and comes from effects like symmetry-breaking (often associated with plateaus and flat curves...).
If perplexity on a task is gradually decreasing then I think that’s probably produced some underlying gradual change in the model (which may be the sum of a ton of tiny discrete changes).
If accuracy and log loss are both improving, I think that’s most likely due to the same underlying phenomenon. That’s not nearly as obvious—it could be that there are two separate phenomena, and one gives rise to gradual improvements in perplexity without affecting accuracy while the other gives rise to abrupt improvements in accuracy without reflecting perplexity—but it still seems like a very natural guess.
The induction bump in particular seems to involve accuracy and log loss improving together, unsurprisingly.
Of course the induction behavior is just one small driver of log loss and so it corresponds to a small blip on the loss or accuracy curves overall, while corresponding to a big jump on some subtasks. In a larger model there are likely to be many events like this that don’t correspond to any blip at all in the overall loss curve while being important for a subtask. This seems unlikely to be the driver of the difference for the BIG bench tasks under discussion, since the continuous log probability improvements and discontinuous accuracy improvements are being measured on the same distribution.
In the case of parities, I think there is a smooth underlying change in the model, e.g. see figure 3 in this paper. I agree that (i) such changes are not always visible in perplexity, e.g. for parities, and therefore it’s not obvious that you will know where to look for them even if they exist, (ii) it’s not obvious whether they always exist, we just know about a few cases we’ve studied like parities and grokking.