1) I like having at least some data; I still found myself using all 10 options at least once. That is, the test still relied to a large extent on my prior knowledge.
2) You’re right about this. I tried and they don’t; guessing 50% every time got me a perfect. I don’t know enough about designing these things to make one with a log scoring rule, but it would definitely be nice to see one.
3) Ooh, that is weird. The gridlines don’t seem to mean as much as the actual numbered labels; taking them off would make this go away.
It seems like neither of these tests is able to measure both calibration and discrimination.
1) I like having at least some data; I still found myself using all 10 options at least once. That is, the test still relied to a large extent on my prior knowledge.
2) You’re right about this. I tried and they don’t; guessing 50% every time got me a perfect. I don’t know enough about designing these things to make one with a log scoring rule, but it would definitely be nice to see one.
3) Ooh, that is weird. The gridlines don’t seem to mean as much as the actual numbered labels; taking them off would make this go away.
It seems like neither of these tests is able to measure both calibration and discrimination.