I think more to the point is that when deviating from Chinchilla optimality, measuring effective compute becomes misleading, you can span larger increases in effective compute for Chinchilla optimal models by doing a detour through overtrained models. And given the price difference, Claude 3.5 Sonnet is likely more overtrained than Claude 3 Opus.
Let’s say we start with a Chinchilla optimal model with N active parameters that trains for 20N tokens using 120N2 FLOPs of compute. We can then train another model with N/3 active parameters for 180N tokens using 360N2 FLOPs of compute, and get approximately the same performance as with the previous model, but we’ve now made use of 3 times more compute, below the RSP’s 4x threshold. Then, we train the next Chinchilla optimal model with 3N active parameters for 60N tokens using 1080N2 FLOPs of compute, an increase by another 3 times, also below the 4x threshold. But only this second step to the new Chinchilla optimal model increases capabilities, and it uses 9x more compute than the previous Chinchilla optimal model.
It looks like the example you gave pretty explicitly is using “compute” rather than “effective compute”. The point of having the “effective” part is to take into account non compute progress, such as using more optimal N/D ratios. I think in your example, the first two models would be at the same effective compute level, based on us predicting the same performance.
That said, I haven’t seen any detailed descriptions of how Anthropic is actually measuring/calculating effective compute (iirc they link to a couple papers and the main theme is that you can use training CE loss as a predictor).
I think in your example, the first two models would be at the same effective compute level, based on us predicting the same performance.
This is a reasonable formulation of what “effective compute” could be defined to mean, but is it actually used in this sense in practice, and who uses it like that? Is it plausible it was used when Anthropic was making the claim that “While Claude 3.5 Sonnet represents an improvement in capabilities over our previously released Opus model, it does not trigger the 4x effective compute threshold” that compares a more Chinchilla optimal model to a more overtrained model?
It’s an interesting thought, I didn’t consider that this sense of “effective compute” could be the intended meaning. I was more thinking about having a compute multiplier measured from perplexity/FLOPs plots of optimal training runs that compare architectures, like in Figure 4 of the Mamba paper, where we can see that Transformer++ (RMSNorm/SwiGLU/etc.) needs about 5 times less compute (2 times less data) than vanilla Transformer to get the same perplexity, so you just multiply physical compute by 5 to find effective compute of Transformer++ with respect to vanilla Transformer. (With this sense of “effective compute”, my argument in the grandparent comment remains the same for effective compute as it is for physical compute.)
In particular, this multiplication still makes sense in order to estimate performance for overtrained models with novel architectures, which is why it’s not obvious that it won’t normally be used like this. So there are two different possible ways of formulating effective compute for overtrained models, which are both useful for different purposes. I was under the impression that simply multiplying by a compute multiplier measured by comparing performance of Chinchilla optimal models of different architectures is how effective compute is usually formulated even for overtrained models, and that the meaning of the other possible formulation that you’ve pointed out is usually discussed in terms of perplexity or more explicitly Chinchilla optimal models with equivalent performance, not in the language of effective compute.
I think more to the point is that when deviating from Chinchilla optimality, measuring effective compute becomes misleading, you can span larger increases in effective compute for Chinchilla optimal models by doing a detour through overtrained models. And given the price difference, Claude 3.5 Sonnet is likely more overtrained than Claude 3 Opus.
Let’s say we start with a Chinchilla optimal model with N active parameters that trains for 20N tokens using 120N2 FLOPs of compute. We can then train another model with N/3 active parameters for 180N tokens using 360N2 FLOPs of compute, and get approximately the same performance as with the previous model, but we’ve now made use of 3 times more compute, below the RSP’s 4x threshold. Then, we train the next Chinchilla optimal model with 3N active parameters for 60N tokens using 1080N2 FLOPs of compute, an increase by another 3 times, also below the 4x threshold. But only this second step to the new Chinchilla optimal model increases capabilities, and it uses 9x more compute than the previous Chinchilla optimal model.
It looks like the example you gave pretty explicitly is using “compute” rather than “effective compute”. The point of having the “effective” part is to take into account non compute progress, such as using more optimal N/D ratios. I think in your example, the first two models would be at the same effective compute level, based on us predicting the same performance.
That said, I haven’t seen any detailed descriptions of how Anthropic is actually measuring/calculating effective compute (iirc they link to a couple papers and the main theme is that you can use training CE loss as a predictor).
This is a reasonable formulation of what “effective compute” could be defined to mean, but is it actually used in this sense in practice, and who uses it like that? Is it plausible it was used when Anthropic was making the claim that “While Claude 3.5 Sonnet represents an improvement in capabilities over our previously released Opus model, it does not trigger the 4x effective compute threshold” that compares a more Chinchilla optimal model to a more overtrained model?
It’s an interesting thought, I didn’t consider that this sense of “effective compute” could be the intended meaning. I was more thinking about having a compute multiplier measured from perplexity/FLOPs plots of optimal training runs that compare architectures, like in Figure 4 of the Mamba paper, where we can see that Transformer++ (RMSNorm/SwiGLU/etc.) needs about 5 times less compute (2 times less data) than vanilla Transformer to get the same perplexity, so you just multiply physical compute by 5 to find effective compute of Transformer++ with respect to vanilla Transformer. (With this sense of “effective compute”, my argument in the grandparent comment remains the same for effective compute as it is for physical compute.)
In particular, this multiplication still makes sense in order to estimate performance for overtrained models with novel architectures, which is why it’s not obvious that it won’t normally be used like this. So there are two different possible ways of formulating effective compute for overtrained models, which are both useful for different purposes. I was under the impression that simply multiplying by a compute multiplier measured by comparing performance of Chinchilla optimal models of different architectures is how effective compute is usually formulated even for overtrained models, and that the meaning of the other possible formulation that you’ve pointed out is usually discussed in terms of perplexity or more explicitly Chinchilla optimal models with equivalent performance, not in the language of effective compute.