Based on my estimate for GPT-4’s loss, I predicted that its performance would be 179.7% better than GPT-3. In reality, GPT-4’s performance was 196.8% better which means my prediction had a percentage error of 8.7%. In other words, I underpredicted the true value by 8.7% which seems like a fairly accurate prediction [1].
Given that we are in the top end of the logistic success curve—getting closer and closer to 100% rather than farther and farther from 0% -- I think a more correct/fair/accurate way to assess this would be to look at the failure rate you predicted vs. the failure rate that actually happened. So, you predicted GPT-4 would get approximately 20% of MMLU wrong, whereas actually it got 13.6% wrong. So basically you predicted it would make 50% more errors than it did.
I still think you deserve some credit for making this prediction, but I wouldn’t call it ‘fairly accurate’ and I definitely don’t think “8.7% off!” is the right way to think about the diff.
At 86.4%, GPT-4′s accuracy is now approaching 100% but GPT-3′s accuracy, which was my prior, was only 43.9%. Obviously one would expect GPT-4′s accuracy to be higher than GPT-3′s since it wouldn’t make sense for OpenAI to release a worse model but it wasn’t clear ex-ante that GPT-4′s accuracy would be near 100%.
I predicted that GPT-4′s accuracy would fall short of 100% accuracy by 20.6% when the true value was 13.6%. Using this approach, the error would be 20.6−13.613.6=0.51
Strictly speaking, the formula for percent error according to Wikipedia is the relative error expressed as a percentage:
percenterror=vtrue−vapproxvtrue×100
I think this is the correct formula to use because what I’m trying to measure is the deviation of the true value from the regression line (predicted value).
Using the formula, the percent error is 86.4−79.486.4×100=8.1
I updated the post to use the term ‘percent error’ with a link to the Wikipedia page and a value of 8.1%.
Suppose you predicted 91% but the actual value was 99%. The percent error may only be about 8% but the likelihood of a wrong answer is 1⁄100 instead of your predicted 9⁄100, which is a huge difference.
In this case, the percent error is 8.1% and the absolute error is 8%. If one student gets 91% on a test and another gets 99% they both get an A so the difference doesn’t seem large to me.
The article linked seems to be missing. Can you explain your point in more detail?
OK. Let’s make it even more extreme. Suppose you take a commercial flight. The likelihood of dying in a crash is on the order of 1 in 10 million. From a percent error or absolute error perspective, 99.99999% isn’t that different from 99% but that is the difference between one plane crash per year globally and a couple of dozen plane crashes per hour on average. These are wildly different in terms of acceptable safety.
Given that we are in the top end of the logistic success curve—getting closer and closer to 100% rather than farther and farther from 0% -- I think a more correct/fair/accurate way to assess this would be to look at the failure rate you predicted vs. the failure rate that actually happened. So, you predicted GPT-4 would get approximately 20% of MMLU wrong, whereas actually it got 13.6% wrong. So basically you predicted it would make 50% more errors than it did.
I still think you deserve some credit for making this prediction, but I wouldn’t call it ‘fairly accurate’ and I definitely don’t think “8.7% off!” is the right way to think about the diff.
At 86.4%, GPT-4′s accuracy is now approaching 100% but GPT-3′s accuracy, which was my prior, was only 43.9%. Obviously one would expect GPT-4′s accuracy to be higher than GPT-3′s since it wouldn’t make sense for OpenAI to release a worse model but it wasn’t clear ex-ante that GPT-4′s accuracy would be near 100%.
I predicted that GPT-4′s accuracy would fall short of 100% accuracy by 20.6% when the true value was 13.6%. Using this approach, the error would be 20.6−13.613.6=0.51
Strictly speaking, the formula for percent error according to Wikipedia is the relative error expressed as a percentage:
percent error=vtrue−vapproxvtrue×100
I think this is the correct formula to use because what I’m trying to measure is the deviation of the true value from the regression line (predicted value).
Using the formula, the percent error is 86.4−79.486.4×100=8.1
I updated the post to use the term ‘percent error’ with a link to the Wikipedia page and a value of 8.1%.
Suppose you predicted 91% but the actual value was 99%. The percent error may only be about 8% but the likelihood of a wrong answer is 1⁄100 instead of your predicted 9⁄100, which is a huge difference.
You may be interested in the links in this post: https://www.lesswrong.com/posts/6Ltniokkr3qt7bzWw/log-odds-or-logits
In this case, the percent error is 8.1% and the absolute error is 8%. If one student gets 91% on a test and another gets 99% they both get an A so the difference doesn’t seem large to me.
The article linked seems to be missing. Can you explain your point in more detail?
OK. Let’s make it even more extreme. Suppose you take a commercial flight. The likelihood of dying in a crash is on the order of 1 in 10 million. From a percent error or absolute error perspective, 99.99999% isn’t that different from 99% but that is the difference between one plane crash per year globally and a couple of dozen plane crashes per hour on average. These are wildly different in terms of acceptable safety.
There’s a backup link in the comments: https://www.thejach.com/public/log-probability.pdf