Claude 3.5 Sonnet solves 64% of problems on an internal agentic coding evaluation, compared to 38% for Claude 3 Opus. Our evaluation tests a model’s ability to understand an open source codebase and implement a pull request, such as a bug fix or new feature, given a natural language description of the desired improvement.
...
While Claude 3.5 Sonnet represents an improvement in capabilities over our previously released Opus model, it does not trigger the 4x effective compute threshold at which we will run the full evaluation protocol described in our Responsible Scaling Policy (RSP).
Hmmm, maybe the 4x effective compute threshold is too large given that you’re getting near doubling of agentic task performance (on what I think is an eval with particularly good validity) but not hitting the threshold.
Or maybe at the very least you should make some falsifiable predictions that might cause you to change this threshold. e.g., “If we train a model that has downstream performance (on any of some DC evals) ≥10% higher than was predicted by our primary prediction metric, we will revisit our prediction model and evaluation threshold.”
It is unknown to me whether Sonnet 3.5′s performance on this agentic coding evaluation was predicted in advance at Anthropic. It seems wild to me that you can double your performance on a high validity ARA-relevant evaluation without triggering the “must evaluate” threshold; I think evaluation should probably be required in that case, and therefore, if I had written the 4x threshold, I would be reducing it. But maybe those who wrote the threshold were totally game for these sorts of capability jumps?
I think more to the point is that when deviating from Chinchilla optimality, measuring effective compute becomes misleading, you can span larger increases in effective compute for Chinchilla optimal models by doing a detour through overtrained models. And given the price difference, Claude 3.5 Sonnet is likely more overtrained than Claude 3 Opus.
Let’s say we start with a Chinchilla optimal model with N active parameters that trains for 20N tokens using 120N2 FLOPs of compute. We can then train another model with N/3 active parameters for 180N tokens using 360N2 FLOPs of compute, and get approximately the same performance as with the previous model, but we’ve now made use of 3 times more compute, below the RSP’s 4x threshold. Then, we train the next Chinchilla optimal model with 3N active parameters for 60N tokens using 1080N2 FLOPs of compute, an increase by another 3 times, also below the 4x threshold. But only this second step to the new Chinchilla optimal model increases capabilities, and it uses 9x more compute than the previous Chinchilla optimal model.
It looks like the example you gave pretty explicitly is using “compute” rather than “effective compute”. The point of having the “effective” part is to take into account non compute progress, such as using more optimal N/D ratios. I think in your example, the first two models would be at the same effective compute level, based on us predicting the same performance.
That said, I haven’t seen any detailed descriptions of how Anthropic is actually measuring/calculating effective compute (iirc they link to a couple papers and the main theme is that you can use training CE loss as a predictor).
I think in your example, the first two models would be at the same effective compute level, based on us predicting the same performance.
This is a reasonable formulation of what “effective compute” could be defined to mean, but is it actually used in this sense in practice, and who uses it like that? Is it plausible it was used when Anthropic was making the claim that “While Claude 3.5 Sonnet represents an improvement in capabilities over our previously released Opus model, it does not trigger the 4x effective compute threshold” that compares a more Chinchilla optimal model to a more overtrained model?
It’s an interesting thought, I didn’t consider that this sense of “effective compute” could be the intended meaning. I was more thinking about having a compute multiplier measured from perplexity/FLOPs plots of optimal training runs that compare architectures, like in Figure 4 of the Mamba paper, where we can see that Transformer++ (RMSNorm/SwiGLU/etc.) needs about 5 times less compute (2 times less data) than vanilla Transformer to get the same perplexity, so you just multiply physical compute by 5 to find effective compute of Transformer++ with respect to vanilla Transformer. (With this sense of “effective compute”, my argument in the grandparent comment remains the same for effective compute as it is for physical compute.)
In particular, this multiplication still makes sense in order to estimate performance for overtrained models with novel architectures, which is why it’s not obvious that it won’t normally be used like this. So there are two different possible ways of formulating effective compute for overtrained models, which are both useful for different purposes. I was under the impression that simply multiplying by a compute multiplier measured by comparing performance of Chinchilla optimal models of different architectures is how effective compute is usually formulated even for overtrained models, and that the meaning of the other possible formulation that you’ve pointed out is usually discussed in terms of perplexity or more explicitly Chinchilla optimal models with equivalent performance, not in the language of effective compute.
...
Hmmm, maybe the 4x effective compute threshold is too large given that you’re getting near doubling of agentic task performance (on what I think is an eval with particularly good validity) but not hitting the threshold.
Or maybe at the very least you should make some falsifiable predictions that might cause you to change this threshold. e.g., “If we train a model that has downstream performance (on any of some DC evals) ≥10% higher than was predicted by our primary prediction metric, we will revisit our prediction model and evaluation threshold.”
It is unknown to me whether Sonnet 3.5′s performance on this agentic coding evaluation was predicted in advance at Anthropic. It seems wild to me that you can double your performance on a high validity ARA-relevant evaluation without triggering the “must evaluate” threshold; I think evaluation should probably be required in that case, and therefore, if I had written the 4x threshold, I would be reducing it. But maybe those who wrote the threshold were totally game for these sorts of capability jumps?
I think more to the point is that when deviating from Chinchilla optimality, measuring effective compute becomes misleading, you can span larger increases in effective compute for Chinchilla optimal models by doing a detour through overtrained models. And given the price difference, Claude 3.5 Sonnet is likely more overtrained than Claude 3 Opus.
Let’s say we start with a Chinchilla optimal model with N active parameters that trains for 20N tokens using 120N2 FLOPs of compute. We can then train another model with N/3 active parameters for 180N tokens using 360N2 FLOPs of compute, and get approximately the same performance as with the previous model, but we’ve now made use of 3 times more compute, below the RSP’s 4x threshold. Then, we train the next Chinchilla optimal model with 3N active parameters for 60N tokens using 1080N2 FLOPs of compute, an increase by another 3 times, also below the 4x threshold. But only this second step to the new Chinchilla optimal model increases capabilities, and it uses 9x more compute than the previous Chinchilla optimal model.
It looks like the example you gave pretty explicitly is using “compute” rather than “effective compute”. The point of having the “effective” part is to take into account non compute progress, such as using more optimal N/D ratios. I think in your example, the first two models would be at the same effective compute level, based on us predicting the same performance.
That said, I haven’t seen any detailed descriptions of how Anthropic is actually measuring/calculating effective compute (iirc they link to a couple papers and the main theme is that you can use training CE loss as a predictor).
This is a reasonable formulation of what “effective compute” could be defined to mean, but is it actually used in this sense in practice, and who uses it like that? Is it plausible it was used when Anthropic was making the claim that “While Claude 3.5 Sonnet represents an improvement in capabilities over our previously released Opus model, it does not trigger the 4x effective compute threshold” that compares a more Chinchilla optimal model to a more overtrained model?
It’s an interesting thought, I didn’t consider that this sense of “effective compute” could be the intended meaning. I was more thinking about having a compute multiplier measured from perplexity/FLOPs plots of optimal training runs that compare architectures, like in Figure 4 of the Mamba paper, where we can see that Transformer++ (RMSNorm/SwiGLU/etc.) needs about 5 times less compute (2 times less data) than vanilla Transformer to get the same perplexity, so you just multiply physical compute by 5 to find effective compute of Transformer++ with respect to vanilla Transformer. (With this sense of “effective compute”, my argument in the grandparent comment remains the same for effective compute as it is for physical compute.)
In particular, this multiplication still makes sense in order to estimate performance for overtrained models with novel architectures, which is why it’s not obvious that it won’t normally be used like this. So there are two different possible ways of formulating effective compute for overtrained models, which are both useful for different purposes. I was under the impression that simply multiplying by a compute multiplier measured by comparing performance of Chinchilla optimal models of different architectures is how effective compute is usually formulated even for overtrained models, and that the meaning of the other possible formulation that you’ve pointed out is usually discussed in terms of perplexity or more explicitly Chinchilla optimal models with equivalent performance, not in the language of effective compute.