From eyeballing the graphs, it looks like the average Brier score is barely below 0.25. This indicates that GPT-4 is better than a dart-throwing monkey (i.e. predicting a random %age, score of 0.33), and barely better than chance (always predicting 50%, score of 0.25).
It would be interesting to see the decompositions for those two naive strategies for that set of questions, and compare to the sub-scores GPT-4 got.
You could also check if GPT-4 is significantly better than chance.
Very interesting!
From eyeballing the graphs, it looks like the average Brier score is barely below 0.25. This indicates that GPT-4 is better than a dart-throwing monkey (i.e. predicting a random %age, score of 0.33), and barely better than chance (always predicting 50%, score of 0.25).
It would be interesting to see the decompositions for those two naive strategies for that set of questions, and compare to the sub-scores GPT-4 got.
You could also check if GPT-4 is significantly better than chance.