Original GPT-4 is rumored to have been a 2e25 FLOPs model (trained on A100s). Then there was a GPT-4T, which might’ve been smaller, and now GPT-4o. In early 2024, 1e26 FLOPs doesn’t seem out of the question, so GPT-4o was potentially trained on 5x compute of original GPT-4.
There is a technical sense of knowledge distillation[1] where in training you target logits of a smarter model rather than raw tokens. It’s been used for training Gemma 2 and Llama 3.2. It’s unclear if knowledge distillation is useful for training similarly-capable models, let alone more capable ones, and GPT-4o seems in most ways more capable than original GPT-4.
Cost of inference is determined by the shape of the model, things like the number of active parameters, which screens off compute used in training (the compute could be anything, cost of inference doesn’t depend on it as long as the model shape doesn’t change).
So compare specific prices with those of models with known size[1]. GPT-4o costs $2.5 per million input tokens, while Llama-3-405B costs $3.5 per million input tokens. That is, it could be a 200-300B model (in active parameters). Original GPT-4 is rumored to be about 270B active parameters (at 1.8T total parameters). It’s OpenAI, not an API provider for an open weights model, so in principle the price could be misleading (below cost), but what data we have points to it being about the same size, maybe 2x smaller if there’s still margin in the price.
Edit: There’s a mistake in the estimate, I got confused between training and inference. Correcting the mistake points to even larger models, though comparing to Llama-3-405B suggests that there is another factor that counterbalances the correction, probably practical issues with getting sufficient batch sizes, so the original conclusion should still be about right.
Original GPT-4 is rumored to have been a 2e25 FLOPs model (trained on A100s). Then there was a GPT-4T, which might’ve been smaller, and now GPT-4o. In early 2024, 1e26 FLOPs doesn’t seem out of the question, so GPT-4o was potentially trained on 5x compute of original GPT-4.
There is a technical sense of knowledge distillation[1] where in training you target logits of a smarter model rather than raw tokens. It’s been used for training Gemma 2 and Llama 3.2. It’s unclear if knowledge distillation is useful for training similarly-capable models, let alone more capable ones, and GPT-4o seems in most ways more capable than original GPT-4.
See this recent paper for example.
Maybe GPT4-o does use more compute than GPT-4, though given that it’s a cheap model for the end user, I wouldn’t really expect that to happen.
Cost of inference is determined by the shape of the model, things like the number of active parameters, which screens off compute used in training (the compute could be anything, cost of inference doesn’t depend on it as long as the model shape doesn’t change).
So compare specific prices with those of models with known size[1]. GPT-4o costs $2.5 per million input tokens, while Llama-3-405B costs $3.5 per million input tokens. That is, it could be a 200-300B model (in active parameters). Original GPT-4 is rumored to be about 270B active parameters (at 1.8T total parameters). It’s OpenAI, not an API provider for an open weights model, so in principle the price could be misleading (below cost), but what data we have points to it being about the same size, maybe 2x smaller if there’s still margin in the price.
Edit: There’s a mistake in the estimate, I got confused between training and inference. Correcting the mistake points to even larger models, though comparing to Llama-3-405B suggests that there is another factor that counterbalances the correction, probably practical issues with getting sufficient batch sizes, so the original conclusion should still be about right.
I just did this exercise for Claude 3.5 Haiku, more details there.