The smooth graphs seem like good evidence that there are much smoother underlying changes in the model, and that the abruptness of the change is about behavior or evaluation rather than what gradient descent is learning. (Even on these relatively narrow tasks, which are themselves much more abrupt than averages across many sub-tasks.) That’s useful if your forecasts are based on trend extrapolation, and suggests that if you want to make forecasts you should be looking at those smoother underlying changes prior to the model performing well on the task.
Predicting where a jump would occur would depend (at least) on details about the evaluation, and on other facts about the distribution of the model’s behavior. Things like: what is the definition of success for the task, how large are the model outputs, how large is the variance in logits. Prima facie if you have continuously increasing bias towards the right answer, you’ll see significant increases in accuracy as the bias becomes large relative to the noise. If your evaluation is the conjunction of multiple steps, you’ll see a rapid increase around the point when your per-step accuracy is high enough to make it through a whole sequence successfully with significant probability. And so on.
If one wanted to move from “evidence that we may be able to predict” to “evidence that we can currently predict” then I agree that you should actually do the experiments where you dig in on those empirics and see how good the estimate is. And clearly the OP is less useful (and much less effort!) than a post that did that actually carried out that kind of careful empirical investigation.
But the basic point seems important, and the high-level take seems more accurate to me than “the presence of abrupt capability jumps suggests that we may not be able to predict future capability changes” (e.g. I think that it someone would have a less accurate view of the situation if they read the previous Anthropic paper on this topic than if they read this post). The evidence for the latter claim is of a very similar speculative nature; it’s just quite hard to talk about predictability either way without actually trying to make predictions.
A nice thing about this setting is that it would in fact be relatively easy to make retrodictions (or even predictions about upcoming models). That is, someone can be blinded to the performance of large models on a given task and try to predict it from observations of smaller models, i.e. by looking at the definition of success on the task, the perplexity and logit variance of smaller models, how those vary across different task instances, etc.
I’m willing to qualitatively predict “yes they could predict it.” From your comments it sounds like you disagree, but obviously we’d have to make it quantitative to have a bet. If we do have a disagreement I’m happy to try to make it more precise in the hopes that doing so may encourage someone to run this experiment and make it clearer how to interpret the results.
The smooth graphs seem like good evidence that there are much smoother underlying changes in the model, and that the abruptness of the change is about behavior or evaluation rather than what gradient descent is learning.
That does not seem true to me and as much of a leap as OP. A priori, if I see a smooth curve in one metric and a discontinuous or abrupt change in another, I do not see how that should make me more confident that it is ‘about behavior or evaluation’. Why should I conclude that? Why can’t it reflect a non-smooth underlying change in the model first? I would only conclude that if I had already ruled out internal changes because I was already committed to the position that NNs can only learn and change internally in smooth small ways… which unfortunately we already know is a false position, because of things like Anthropic’s induction bump, which show phase transitions in the internals of the model which is nearly invisible on the loss. (And also, incidentally, because the bump is so small and the training curve still so smooth, falsifies the more modest claim that small changes in perplexity must reflect small changes in the model internals—maybe usually small changes do not reflect non-smooth underlying changes, but nevertheless, it is entirely possible and does happen, and we would surely find many more routine examples if we had better interpretability so examining a single instance didn’t take man-years.) And also a priori, from the old statistical mechanics literature, you should expect abrupt phase changes of various sorts in NN models (which may or may not be visible in the training curve), like parity models, where the task is so simple and clearly defined that it cannot have anything to do with the ‘behavior’ or ‘evaluation’ being wrong, and comes from effects like symmetry-breaking (often associated with plateaus and flat curves...).
If perplexity on a task is gradually decreasing then I think that’s probably produced some underlying gradual change in the model (which may be the sum of a ton of tiny discrete changes).
If accuracy and log loss are both improving, I think that’s most likely due to the same underlying phenomenon. That’s not nearly as obvious—it could be that there are two separate phenomena, and one gives rise to gradual improvements in perplexity without affecting accuracy while the other gives rise to abrupt improvements in accuracy without reflecting perplexity—but it still seems like a very natural guess.
The induction bump in particular seems to involve accuracy and log loss improving together, unsurprisingly.
Of course the induction behavior is just one small driver of log loss and so it corresponds to a small blip on the loss or accuracy curves overall, while corresponding to a big jump on some subtasks. In a larger model there are likely to be many events like this that don’t correspond to any blip at all in the overall loss curve while being important for a subtask. This seems unlikely to be the driver of the difference for the BIG bench tasks under discussion, since the continuous log probability improvements and discontinuous accuracy improvements are being measured on the same distribution.
In the case of parities, I think there is a smooth underlying change in the model, e.g. see figure 3 in this paper. I agree that (i) such changes are not always visible in perplexity, e.g. for parities, and therefore it’s not obvious that you will know where to look for them even if they exist, (ii) it’s not obvious whether they always exist, we just know about a few cases we’ve studied like parities and grokking.
The smooth graphs seem like good evidence that there are much smoother underlying changes in the model, and that the abruptness of the change is about behavior or evaluation rather than what gradient descent is learning.
If we’re trying to predict abrupt changes in the accuracy of output token sequences, the per-token log-likelihood can be a useful signal. What’s the analogous signal when we’re talking about abrupt changes in a model’s ability to deceptively conceal capabilities, hack GPU firmware, etc.? What log-likelihood plots can we use to predict those types of abrupt changes in behavior?
Here, I think we’ll want to look for suspicious changes in the log-likelihood trends. E.g., it’s a red flag if we see steady increases in log-likelihood on some scary behavior, but then the trend reverse at some level of model scale.
The smooth graphs seem like good evidence that there are much smoother underlying changes in the model, and that the abruptness of the change is about behavior or evaluation rather than what gradient descent is learning. (Even on these relatively narrow tasks, which are themselves much more abrupt than averages across many sub-tasks.) That’s useful if your forecasts are based on trend extrapolation, and suggests that if you want to make forecasts you should be looking at those smoother underlying changes prior to the model performing well on the task.
Predicting where a jump would occur would depend (at least) on details about the evaluation, and on other facts about the distribution of the model’s behavior. Things like: what is the definition of success for the task, how large are the model outputs, how large is the variance in logits. Prima facie if you have continuously increasing bias towards the right answer, you’ll see significant increases in accuracy as the bias becomes large relative to the noise. If your evaluation is the conjunction of multiple steps, you’ll see a rapid increase around the point when your per-step accuracy is high enough to make it through a whole sequence successfully with significant probability. And so on.
If one wanted to move from “evidence that we may be able to predict” to “evidence that we can currently predict” then I agree that you should actually do the experiments where you dig in on those empirics and see how good the estimate is. And clearly the OP is less useful (and much less effort!) than a post that did that actually carried out that kind of careful empirical investigation.
But the basic point seems important, and the high-level take seems more accurate to me than “the presence of abrupt capability jumps suggests that we may not be able to predict future capability changes” (e.g. I think that it someone would have a less accurate view of the situation if they read the previous Anthropic paper on this topic than if they read this post). The evidence for the latter claim is of a very similar speculative nature; it’s just quite hard to talk about predictability either way without actually trying to make predictions.
A nice thing about this setting is that it would in fact be relatively easy to make retrodictions (or even predictions about upcoming models). That is, someone can be blinded to the performance of large models on a given task and try to predict it from observations of smaller models, i.e. by looking at the definition of success on the task, the perplexity and logit variance of smaller models, how those vary across different task instances, etc.
I’m willing to qualitatively predict “yes they could predict it.” From your comments it sounds like you disagree, but obviously we’d have to make it quantitative to have a bet. If we do have a disagreement I’m happy to try to make it more precise in the hopes that doing so may encourage someone to run this experiment and make it clearer how to interpret the results.
That does not seem true to me and as much of a leap as OP. A priori, if I see a smooth curve in one metric and a discontinuous or abrupt change in another, I do not see how that should make me more confident that it is ‘about behavior or evaluation’. Why should I conclude that? Why can’t it reflect a non-smooth underlying change in the model first? I would only conclude that if I had already ruled out internal changes because I was already committed to the position that NNs can only learn and change internally in smooth small ways… which unfortunately we already know is a false position, because of things like Anthropic’s induction bump, which show phase transitions in the internals of the model which is nearly invisible on the loss. (And also, incidentally, because the bump is so small and the training curve still so smooth, falsifies the more modest claim that small changes in perplexity must reflect small changes in the model internals—maybe usually small changes do not reflect non-smooth underlying changes, but nevertheless, it is entirely possible and does happen, and we would surely find many more routine examples if we had better interpretability so examining a single instance didn’t take man-years.) And also a priori, from the old statistical mechanics literature, you should expect abrupt phase changes of various sorts in NN models (which may or may not be visible in the training curve), like parity models, where the task is so simple and clearly defined that it cannot have anything to do with the ‘behavior’ or ‘evaluation’ being wrong, and comes from effects like symmetry-breaking (often associated with plateaus and flat curves...).
If perplexity on a task is gradually decreasing then I think that’s probably produced some underlying gradual change in the model (which may be the sum of a ton of tiny discrete changes).
If accuracy and log loss are both improving, I think that’s most likely due to the same underlying phenomenon. That’s not nearly as obvious—it could be that there are two separate phenomena, and one gives rise to gradual improvements in perplexity without affecting accuracy while the other gives rise to abrupt improvements in accuracy without reflecting perplexity—but it still seems like a very natural guess.
The induction bump in particular seems to involve accuracy and log loss improving together, unsurprisingly.
Of course the induction behavior is just one small driver of log loss and so it corresponds to a small blip on the loss or accuracy curves overall, while corresponding to a big jump on some subtasks. In a larger model there are likely to be many events like this that don’t correspond to any blip at all in the overall loss curve while being important for a subtask. This seems unlikely to be the driver of the difference for the BIG bench tasks under discussion, since the continuous log probability improvements and discontinuous accuracy improvements are being measured on the same distribution.
In the case of parities, I think there is a smooth underlying change in the model, e.g. see figure 3 in this paper. I agree that (i) such changes are not always visible in perplexity, e.g. for parities, and therefore it’s not obvious that you will know where to look for them even if they exist, (ii) it’s not obvious whether they always exist, we just know about a few cases we’ve studied like parities and grokking.
If we’re trying to predict abrupt changes in the accuracy of output token sequences, the per-token log-likelihood can be a useful signal. What’s the analogous signal when we’re talking about abrupt changes in a model’s ability to deceptively conceal capabilities, hack GPU firmware, etc.? What log-likelihood plots can we use to predict those types of abrupt changes in behavior?
Here, I think we’ll want to look for suspicious changes in the log-likelihood trends. E.g., it’s a red flag if we see steady increases in log-likelihood on some scary behavior, but then the trend reverse at some level of model scale.