1) In the estimating test, you have to figure out things in a void, with no clue from the question. But in this test, if the question is whether Sarah Blogg was Humphrey Bogart’s second wife, my estimate goes from 0.00001% to 50%. So I often find myself guessing whether it’s a trick question.
2) The results don’t seem to take accuracy into account, meaning you might get perfect score by answering “50%” on all question (I haven’t tried). Seeing a log scoring system would be better. (But then I didn’t dig too much for their formula)
3) Their graph is ugly. The vertical don’t line up with the numbers at the bottom! Geez!
1) I like having at least some data; I still found myself using all 10 options at least once. That is, the test still relied to a large extent on my prior knowledge.
2) You’re right about this. I tried and they don’t; guessing 50% every time got me a perfect. I don’t know enough about designing these things to make one with a log scoring rule, but it would definitely be nice to see one.
3) Ooh, that is weird. The gridlines don’t seem to mean as much as the actual numbered labels; taking them off would make this go away.
It seems like neither of these tests is able to measure both calibration and discrimination.
I got 73.
I didn’t find this test as good as the other one:
1) In the estimating test, you have to figure out things in a void, with no clue from the question. But in this test, if the question is whether Sarah Blogg was Humphrey Bogart’s second wife, my estimate goes from 0.00001% to 50%. So I often find myself guessing whether it’s a trick question.
2) The results don’t seem to take accuracy into account, meaning you might get perfect score by answering “50%” on all question (I haven’t tried). Seeing a log scoring system would be better. (But then I didn’t dig too much for their formula)
3) Their graph is ugly. The vertical don’t line up with the numbers at the bottom! Geez!
1) I like having at least some data; I still found myself using all 10 options at least once. That is, the test still relied to a large extent on my prior knowledge.
2) You’re right about this. I tried and they don’t; guessing 50% every time got me a perfect. I don’t know enough about designing these things to make one with a log scoring rule, but it would definitely be nice to see one.
3) Ooh, that is weird. The gridlines don’t seem to mean as much as the actual numbered labels; taking them off would make this go away.
It seems like neither of these tests is able to measure both calibration and discrimination.