tenthkrige comments on Are language models good at making predictions?

tenthkrige 7 Nov 2023 13:56 UTC
3 points
2
Very interesting!

From eyeballing the graphs, it looks like the average Brier score is barely below 0.25. This indicates that GPT-4 is better than a dart-throwing monkey (i.e. predicting a random %age, score of 0.33), and barely better than chance (always predicting 50%, score of 0.25).

It would be interesting to see the decompositions for those two naive strategies for that set of questions, and compare to the sub-scores GPT-4 got.

You could also check if GPT-4 is significantly better than chance.