Is that table representative of the data? If so, it is a very poor dataset. Most of those questions look very in-group, to which it is accurately forecasting 0.5, since anyone outside that bubble has no idea of the answer.
I wonder how different it is if you filter out every question with a first person pronoun, or that mentions anyone who was not Wikipedia-notable as of the cut off date.
Perhaps it does well in politics and sports because those are the only categories about general knowledge that have a decent number of questions to evaluate. (Per the y-scale in the per category graphs.) Though finance appears to contradict that, since it has similar amount of questions and uncertainty score.
It appears you only show uncertainty relative to its own predictions, and not whether the data from Manifold showed it to be an uncertain question even to Manifold users.
I also would’ve expected to see some evidence of that being a good prompt, rather than leaving it open whether the entire outcome is an artifact of the prompt given.
Is that table representative of the data? If so, it is a very poor dataset. Most of those questions look very in-group, to which it is accurately forecasting 0.5, since anyone outside that bubble has no idea of the answer.
I wonder how different it is if you filter out every question with a first person pronoun, or that mentions anyone who was not Wikipedia-notable as of the cut off date.
Perhaps it does well in politics and sports because those are the only categories about general knowledge that have a decent number of questions to evaluate. (Per the y-scale in the per category graphs.) Though finance appears to contradict that, since it has similar amount of questions and uncertainty score.
It appears you only show uncertainty relative to its own predictions, and not whether the data from Manifold showed it to be an uncertain question even to Manifold users.
I also would’ve expected to see some evidence of that being a good prompt, rather than leaving it open whether the entire outcome is an artifact of the prompt given.