Aaron_Scher comments on Claude 3.5 Sonnet

Aaron_Scher 23 Jun 2024 23:31 UTC
3 points
0
It looks like the example you gave pretty explicitly is using “compute” rather than “effective compute”. The point of having the “effective” part is to take into account non compute progress, such as using more optimal N/D ratios. I think in your example, the first two models would be at the same effective compute level, based on us predicting the same performance.

That said, I haven’t seen any detailed descriptions of how Anthropic is actually measuring/calculating effective compute (iirc they link to a couple papers and the main theme is that you can use training CE loss as a predictor).
- Vladimir_Nesov 23 Jun 2024 23:58 UTC
  2 points
  0
  Parent
  
  I think in your example, the first two models would be at the same effective compute level, based on us predicting the same performance.
  
  This is a reasonable formulation of what “effective compute” could be defined to mean, but is it actually used in this sense in practice, and who uses it like that? Is it plausible it was used when Anthropic was making the claim that “While Claude 3.5 Sonnet represents an improvement in capabilities over our previously released Opus model, it does not trigger the 4x effective compute threshold” that compares a more Chinchilla optimal model to a more overtrained model?
  
  It’s an interesting thought, I didn’t consider that this sense of “effective compute” could be the intended meaning. I was more thinking about having a compute multiplier measured from perplexity/FLOPs plots of optimal training runs that compare architectures, like in Figure 4 of the Mamba paper, where we can see that Transformer++ (RMSNorm/SwiGLU/etc.) needs about 5 times less compute (2 times less data) than vanilla Transformer to get the same perplexity, so you just multiply physical compute by 5 to find effective compute of Transformer++ with respect to vanilla Transformer. (With this sense of “effective compute”, my argument in the grandparent comment remains the same for effective compute as it is for physical compute.)
  
  In particular, this multiplication still makes sense in order to estimate performance for overtrained models with novel architectures, which is why it’s not obvious that it won’t normally be used like this. So there are two different possible ways of formulating effective compute for overtrained models, which are both useful for different purposes. I was under the impression that simply multiplying by a compute multiplier measured by comparing performance of Chinchilla optimal models of different architectures is how effective compute is usually formulated even for overtrained models, and that the meaning of the other possible formulation that you’ve pointed out is usually discussed in terms of perplexity or more explicitly Chinchilla optimal models with equivalent performance, not in the language of effective compute.