One possible problem is that while we might expect log(P(scary coherent behavior)) to go up in general as we scale models, this doesn’t mean that log(P(scary coherent behavior)) - log(P(coherent behavior)) goes up—it could simply be that the models are getting better at being coherent in general. In some cases, it could even be that the model becomes less overconfident!
For example, in figure six of the the Wei et al emergence paper, the log probability assigned to both the correct multiple choice answers and the incorrect answers both go up slowly, until they diverge at a bit over 10^22 flops:
The authors explain:
The reason is that larger models produce less-extreme probabilities (i.e., values approaching 0 or 1) and therefore the average log-probabilities have fewer extremely small values.
Also, the same figure suggests that log(P(behavior)) trends don’t always continue forever---(log(P(incorrect)) certainly doesn’t), so I’d caution against reading too much into just log-likelihood/cross entropy loss.
That being said, I still think we should try to come up with a better understanding of smooth underlying changes, as well as try to come up with a theory of the “critical thresholds”. As a start, someone should probably try to either retrodict when model capabilities emerge given the log-likelihoods mentioned in this post, or when grokking occurs using the metrics given in Neel Nanda’s modular arithmetic post.
One possible problem is that while we might expect log(P(scary coherent behavior)) to go up in general as we scale models, this doesn’t mean that log(P(scary coherent behavior)) - log(P(coherent behavior)) goes up—it could simply be that the models are getting better at being coherent in general. In some cases, it could even be that the model becomes less overconfident!
For example, in figure six of the the Wei et al emergence paper, the log probability assigned to both the correct multiple choice answers and the incorrect answers both go up slowly, until they diverge at a bit over 10^22 flops:
The authors explain:
Also, the same figure suggests that log(P(behavior)) trends don’t always continue forever---(log(P(incorrect)) certainly doesn’t), so I’d caution against reading too much into just log-likelihood/cross entropy loss.
That being said, I still think we should try to come up with a better understanding of smooth underlying changes, as well as try to come up with a theory of the “critical thresholds”. As a start, someone should probably try to either retrodict when model capabilities emerge given the log-likelihoods mentioned in this post, or when grokking occurs using the metrics given in Neel Nanda’s modular arithmetic post.