Cost of inference is determined by the shape of the model, things like the number of active parameters, which screens off compute used in training (the compute could be anything, cost of inference doesn’t depend on it as long as the model shape doesn’t change).
So compare specific prices with those of models with known size[1]. GPT-4o costs $2.5 per million input tokens, while Llama-3-405B costs $3.5 per million input tokens. That is, it could be a 200-300B model (in active parameters). Original GPT-4 is rumored to be about 270B active parameters (at 1.8T total parameters). It’s OpenAI, not an API provider for an open weights model, so in principle the price could be misleading (below cost), but what data we have points to it being about the same size, maybe 2x smaller if there’s still margin in the price.
Edit: There’s a mistake in the estimate, I got confused between training and inference. Correcting the mistake points to even larger models, though comparing to Llama-3-405B suggests that there is another factor that counterbalances the correction, probably practical issues with getting sufficient batch sizes, so the original conclusion should still be about right.
Cost of inference is determined by the shape of the model, things like the number of active parameters, which screens off compute used in training (the compute could be anything, cost of inference doesn’t depend on it as long as the model shape doesn’t change).
So compare specific prices with those of models with known size[1]. GPT-4o costs $2.5 per million input tokens, while Llama-3-405B costs $3.5 per million input tokens. That is, it could be a 200-300B model (in active parameters). Original GPT-4 is rumored to be about 270B active parameters (at 1.8T total parameters). It’s OpenAI, not an API provider for an open weights model, so in principle the price could be misleading (below cost), but what data we have points to it being about the same size, maybe 2x smaller if there’s still margin in the price.
Edit: There’s a mistake in the estimate, I got confused between training and inference. Correcting the mistake points to even larger models, though comparing to Llama-3-405B suggests that there is another factor that counterbalances the correction, probably practical issues with getting sufficient batch sizes, so the original conclusion should still be about right.
I just did this exercise for Claude 3.5 Haiku, more details there.