My estimate is about 400 billion parameters (100 billion − 1 trillion) based on EpochAI’s estimate of GPT-4′s training compute and scaling laws which can be used to calculate the optimal number of parameters and training tokens that should be used for language models given a certain compute budget.
Although 1 trillion sounds impressive and bigger models tend to achieve a lower loss given a fixed amount of data, an increased number of parameters is not necessarily more desirable because a bigger model uses more compute and therefore can’t be trained on as much data.
If the model is made too big, the decrease in training tokens actually exceeds the benefit of the larger model leading to worse performance.
“our analysis clearly suggests that given the training compute budget for many current LLMs, smaller models should have been trained on more tokens to achieve the most performant model.”
Another quote from the paper:
“Unless one has a compute budget of 1026 FLOPs (over 250× the compute used to train Gopher), a 1 trillion parameter model is unlikely to be the optimal model to train.”
So unless the EpochAI estimate is too low by about an order of magnitude [1] or OpenAI has discovered new and better scaling laws, the number of parameters in GPT-4 is probably lower than 1 trillion.
My Twitter thread estimating the number of parameters in GPT-4.
My estimate is about 400 billion parameters (100 billion − 1 trillion) based on EpochAI’s estimate of GPT-4′s training compute and scaling laws which can be used to calculate the optimal number of parameters and training tokens that should be used for language models given a certain compute budget.
Although 1 trillion sounds impressive and bigger models tend to achieve a lower loss given a fixed amount of data, an increased number of parameters is not necessarily more desirable because a bigger model uses more compute and therefore can’t be trained on as much data.
If the model is made too big, the decrease in training tokens actually exceeds the benefit of the larger model leading to worse performance.
Extract from the Training Compute-Optimal Language Models paper:
Another quote from the paper:
So unless the EpochAI estimate is too low by about an order of magnitude [1] or OpenAI has discovered new and better scaling laws, the number of parameters in GPT-4 is probably lower than 1 trillion.
My Twitter thread estimating the number of parameters in GPT-4.
I don’t think it is but it could be.