Retrospective on ‘GPT-4 Predictions’ After the Release of GPT-4
In February 2023, I wrote a post named GPT-4 Predictions which was an attempt to predict the properties and capabilities of OpenAI’s GPT-4 model using scaling laws and knowledge of past models such as GPT-3. Now that GPT-4 has been released, I’d like to evaluate these past predictions.
Unfortunately, since the GPT-4 technical report has limited information on GPT-4’s training process and model properties, I can’t evaluate all the predictions. Nevertheless, I believe I can evaluate enough of them right now to yield useful insights.
GPT-4 release date
OpenAI released GPT-4 on 14 March 2023.
I mentioned in the post that Metaculus predicted a 50% chance of GPT-4 being released by May 2023 and consequently, I expected the model to be released sometime around the middle of the year so the model was released earlier than I expected.
Training process
Number of GPUs used during training
Some people such as LawrenceC and gwern have noted in the post’s comments that GPT-4 was probably trained on 15,000 GPUs or more. Assuming this is true, my prediction that GPT-4 would be trained on 2,000 to 15,000 GPUs seems like an underprediction and consequently, I may have underpredicted GPT-4’s total training compute by about a factor of 2.
I originally predicted that GPT-4 would use about 5.63e24 FLOP of compute. According to EpochAI, the true value is about 2.2e25 which is about 4x my original estimate. The chart below also shows how GPT-4 came out earlier than I expected.
Training time
The OpenAI GPT-4 video states that GPT-4 finished training in August 2022. Given that GPT-3.5 finished training in early 2022 this suggests that GPT-4 was trained for about 4-7 months. I originally predicted that the training time would be 1-6 months which seems like an underprediction in retrospect.
GPT-4 model properties
I predicted that GPT-4 would be a dense, text-only, transformer language model like GPT-3 trained using more compute and data with a similar number of parameters (175B) and a longer context window (8k tokens).
My most obviously incorrect prediction was predicting that GPT-4 would be a text-only language model like GPT-3. Instead, GPT-4 is a multimodal model that accepts both text and images as inputs though it only outputs text.
Apart from that, I think my predictions about the model were mostly correct: GPT-4 is a pre-trained transformer language model trained using next-word prediction, fine-tuning, and RLHF like its predecessors.
OpenAI hasn’t yet published information such as the number of parameters in the model but we can infer these properties using other information.
In the previous post, I created a linear correlation between loss and MMLU performance. I used the same dataset to create the opposite model: MMLU to loss. Given that GPT-4′s performance was 86.4% on MMLU, we can use the model to estimate that GPT-4′s cross-entropy loss per word is about 1.85 (which is lower than my predicted value of 1.87).
Given GPT-4′s estimated training compute and loss and a set of plausible values for its number of training tokens (e.g. 1e12 to 1e13), we can use the tables from the Chinchilla paper (“Training Compute-Optimal Large Language Models”) to estimate the number of parameters in the model. Using this method, I estimate that there are 300 − 500B parameters in the GPT-4 model.
GPT-4 performance
MMLU performance
Fortunately, both my post and the GPT-4 technical report referenced the MMLU benchmark. In the previous post, I predicted that GPT-4 could set a new record on the MMLU benchmark and I specifically predicted that GPT-4 could achieve 79.4% accuracy on the benchmark given my prediction of the model’s loss which is better than the previous record of 75.2% set by a fine-tuned version of PaLM.
GPT-4 in fact achieved 86.4% on the MMLU benchmark which is a new record and higher than I predicted. My prediction vs GPT-4’s actual accuracy on the MMLU benchmark is shown in the following graph.
The GPT-4 paper says that GPT-4′s loss when predicting OpenAI’s internal codebase is about 1.2 bit/word which is 1.73 nat/word which is much lower than my predicted value of 1.87. I’m not sure what the true value is so I’m going to assume that it’s 1.85 in the following chart based on the amount of compute used to train GPT-4:
The percent error between my prediction and the true value of GPT-4′s performance on the MMLU dataset is 8.1% which seems like a fairly accurate prediction [1].
GPT-4 writing ability
Based on GPT-3’s improvement trend from the GPT-3 paper, I also predicted that human evaluators would only be able to distinguish model-generated text from human-written text about 50% of the time. In other words, I predicted that GPT-4’s text would be indistinguishable from human-written text.
From my personal experience, GPT-4-generated text seems indistinguishable from human-written text though there doesn’t seem to be any quantitative evaluation of this metric for GPT-4 yet.
Context length
Given that GPT-3 and GPT-3.5 had context lengths of 2048 tokens and 4096 tokens respectively, my guess was that GPT-4 would have a context length of 8192 tokens.
According to the OpenAI API, one of the GPT-4 models does indeed have a context length of 8192 tokens. However, there is another model with 32,768 tokens. Therefore, my prediction was partially correct but also underestimated the increase in context length.
Prediction framework
My predictions of GPT-4’s performance were based on the following assumptions:
Model loss can be accurately calculated using scaling laws that can estimate a model’s loss given inputs such as the number of parameters in the model, the amount of training compute, and training data.
There is a power law relationship between increases in these inputs and decreases in loss.
Decreases in model loss are linearly correlated with improved performance as measured by benchmarks such as MMLU [2].
GPT-4 includes no significant algorithmic advances that would significantly increase the model’s compute efficiency, data efficiency, or performance.
The prediction framework is summarized in this diagram:
Despite the fact that these simplifying assumptions could have limited the accuracy of the prediction model, I believe I was able to at least predict GPT-4′s loss and some of GPT-4’s capabilities fairly accurately given knowledge of scaling laws derived from the behavior of smaller model and knowledge of the current capabilities, model properties, and training process of GPT-3 and similar models.
Similarly, the GPT-4 technical report includes details on how OpenAI used smaller models to predict GPT-4’s performance:
“A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4’s performance.”
Given that OpenAI has full access to all information about GPT-3 and GPT-4, their predictions were probably more accurate than mine.
Limitations of the framework
I think the biggest limitation of the framework is its neglect of algorithmic advances such as the introduction of image inputs to the GPT-4 model. Not taking algorithmic advances into account could also explain why I underestimated GPT-4′s performance improvement on the MMLU benchmark.
Although the average capabilities of language models tend to scale smoothly given more resources, specific capabilities can increase abruptly because of emergent capabilities. Therefore, a model that predicts linear improvements on certain capabilities in the short term could merely be a short tangent in a more complex non-linear model. This suggests that predicting specific capabilities in the long term is significantly more difficult.
Summary of predictions
Name | Prediction | Reality | Difference |
GPT-4 release date | 05/2023 | 03/2023 | NA |
GPT-4 training compute | 5.63e24 | 2.2e25 | 390% |
GPT-4 model parameters | 175B | 300 − 1000B | 70 − 570% |
GPT-4 MMLU performance (%) | 79.4 | 86.4 | 8.1% |
Conclusions
GPT-4 was released earlier than I expected and consequently, I published the “GPT-4 Predictions” post just a month before the release of GPT-4 which possibly limited its utility. Given that the post was mostly based on data from 2020 and 2021 on models such as GPT-3, I think I could have made the predictions much earlier without a significant loss of accuracy. For example, if I had written the post in early 2021 it would have been published 2 years before the release of GPT-4.
I focused on benchmarks such as MMLU but I can now see from the GPT-4 technical report that human tests such as the SAT are also useful for evaluating language models.
I didn’t make any predictions on the safety improvements of GPT-4 over GPT-3 and such predictions could have been insightful.
My predictions seem to be evidence that it’s possible to use scaling laws and other predictable quantitative methods to predict the general performance of language models at least in the short term.
However, as the table summary of my predictions shows, many of my predictions were inaccurate despite the fact that I was merely predicting the properties of a model a single generation later over a period of fewer than 3 years. Given the increased effect of algorithmic advances on ML capabilities in the long term and the inherent unpredictability of scientific progress, I expect accurately predicting the capabilities of ML models in the long term (>5 years) to be much more challenging and maybe even impossible [3].
Even though I used fairly rigorous quantitative methods, my predictions were still inaccurate to some degree. I expect predictions based on narratives or intuitions to be even less accurate. Overall, this prediction exercise suggests that in addition to the future being difficult to predict, we should probably believe that most predictions about the future are wrong to a certain extent.
To summarize, I believe a prediction is more likely to be correct if:
It’s based on simple quantitative empirically-supported methods such as scaling laws.
It’s short-term.
It focused on predicting some narrowly defined aspect of the future and avoids being too ambitious.
Conversely, I expect most long-term, sweeping general predictions that are based on intuitions or specific narratives to be very inaccurate or wrong.
- ^
As far as I know, the GPT-4 Technical Report also evaluates GPT-3.5 on the MMLU benchmark for the first time (source).
- ^
This Anthropic paper notes that GPT-3′s MMLU performance improves very slowly when the model is below 10B parameters and then more quickly above that threshold which is a non-linear relationship.
- ^
There is evidence showing that algorithmic progress increases predictably over time.
- 17 Mar 2023 23:33 UTC; 3 points) 's comment on A proposed method for forecasting transformative AI by (
- 17 Mar 2023 20:34 UTC; 1 point) 's comment on A concrete bet offer to those with short AGI timelines by (
Given that we are in the top end of the logistic success curve—getting closer and closer to 100% rather than farther and farther from 0% -- I think a more correct/fair/accurate way to assess this would be to look at the failure rate you predicted vs. the failure rate that actually happened. So, you predicted GPT-4 would get approximately 20% of MMLU wrong, whereas actually it got 13.6% wrong. So basically you predicted it would make 50% more errors than it did.
I still think you deserve some credit for making this prediction, but I wouldn’t call it ‘fairly accurate’ and I definitely don’t think “8.7% off!” is the right way to think about the diff.
At 86.4%, GPT-4′s accuracy is now approaching 100% but GPT-3′s accuracy, which was my prior, was only 43.9%. Obviously one would expect GPT-4′s accuracy to be higher than GPT-3′s since it wouldn’t make sense for OpenAI to release a worse model but it wasn’t clear ex-ante that GPT-4′s accuracy would be near 100%.
I predicted that GPT-4′s accuracy would fall short of 100% accuracy by 20.6% when the true value was 13.6%. Using this approach, the error would be 20.6−13.613.6=0.51
Strictly speaking, the formula for percent error according to Wikipedia is the relative error expressed as a percentage:
percent error=vtrue−vapproxvtrue×100
I think this is the correct formula to use because what I’m trying to measure is the deviation of the true value from the regression line (predicted value).
Using the formula, the percent error is 86.4−79.486.4×100=8.1
I updated the post to use the term ‘percent error’ with a link to the Wikipedia page and a value of 8.1%.
Suppose you predicted 91% but the actual value was 99%. The percent error may only be about 8% but the likelihood of a wrong answer is 1⁄100 instead of your predicted 9⁄100, which is a huge difference.
You may be interested in the links in this post: https://www.lesswrong.com/posts/6Ltniokkr3qt7bzWw/log-odds-or-logits
In this case, the percent error is 8.1% and the absolute error is 8%. If one student gets 91% on a test and another gets 99% they both get an A so the difference doesn’t seem large to me.
The article linked seems to be missing. Can you explain your point in more detail?
OK. Let’s make it even more extreme. Suppose you take a commercial flight. The likelihood of dying in a crash is on the order of 1 in 10 million. From a percent error or absolute error perspective, 99.99999% isn’t that different from 99% but that is the difference between one plane crash per year globally and a couple of dozen plane crashes per hour on average. These are wildly different in terms of acceptable safety.
There’s a backup link in the comments: https://www.thejach.com/public/log-probability.pdf
Data seems to be a bottleneck, so we should expect the number of model parameters to run high to compensate.
Note, that a MMLU of 100% should be achievable using a model the same size as Megatron-Turing NLG, and a data only 2.1x more data than GPT-4, which should be achievable in the near term.