This is, in fact, close to being the worst system ever devised. The fact that something is widely used does not mean that it is any good. Examining the results of this kind of system shows that, when applied to unfamilliar material, they consistently give the best marks to the worst students. If the best students can’t do every problem with extreme ease, they tend to venture answers where poor students do not. This results in the best students dropping towards the median score and the highest scores going to poor students who were lucky. Applying the system to familliar material should produce a similar, though less pronounced, effect. Adding penalties lowers the dispersion about the mean, which always makes an exam less useful.
Exam systems that have no penalty for wrong answers are better than ones that do, but are still imperfect. The only reliable way to guage students ability is to have far more questions (preferably taken as several papers), to reduce the effect of mistakes relative to ignorance and to increase the number of areas examined. This is generally cost-prohibitive. It also tests students’ ability to answer exam questions, rather than testing their understanding. There is, fortunately, a way to test understanding—a student understands material when they can rediscover the ideas that draw on it.
This is, in fact, close to being the worst system ever devised.
Not really- it teaches calibration as well as correctness. Are you more than 50% sure? No? Then don’t guess.
In fact, it shares several properties with the best system ever devised (for multiple choice questions, at least): the test-taker assigns a probability to each of the answers (and the total probability doled out must sum to one), and is graded based on the logarithm of the probability they assigned to the correct answer. (Typically, there’s an offset so that assigning equal probability to all possibilities gives a score of 0, so that it is possible to get positive points.)
Examining the results of this kind of system shows that, when applied to unfamilliar material, they consistently give the best marks to the worst students.
Do you have linkable results? My experience with the probability log-scoring is that, even on the first test, the median score is somewhat better than 0, there are several negative scorers, but the test-takers who received the best marks (who are both high-accuracy and high-calibration) are noticeably different from the pack, and are hardly the worst students.
The worst marks often go to students whose accuracy is high, but whose calibration is low, but that goes away once they learn calibration, which seems like a feature, not a bug.
If the best students can’t do every problem with extreme ease, they tend to venture answers where poor students do not. This results in the best students dropping towards the median score and the highest scores going to poor students who were lucky.
How can poor students get lucky if they don’t venture answers to questions where they are not sure?
The only reliable way to guage students ability is to have far more questions (preferably taken as several papers), to reduce the effect of mistakes relative to ignorance and to increase the number of areas examined.
The trouble with this approach is that you then are also grading speed and resistance to mental fatigue. In some cases, that is desirable; in others, not.
This is, in fact, close to being the worst system ever devised. The fact that something is widely used does not mean that it is any good. Examining the results of this kind of system shows that, when applied to unfamilliar material, they consistently give the best marks to the worst students. If the best students can’t do every problem with extreme ease, they tend to venture answers where poor students do not. This results in the best students dropping towards the median score and the highest scores going to poor students who were lucky. Applying the system to familliar material should produce a similar, though less pronounced, effect. Adding penalties lowers the dispersion about the mean, which always makes an exam less useful.
Exam systems that have no penalty for wrong answers are better than ones that do, but are still imperfect. The only reliable way to guage students ability is to have far more questions (preferably taken as several papers), to reduce the effect of mistakes relative to ignorance and to increase the number of areas examined. This is generally cost-prohibitive. It also tests students’ ability to answer exam questions, rather than testing their understanding. There is, fortunately, a way to test understanding—a student understands material when they can rediscover the ideas that draw on it.
Not really- it teaches calibration as well as correctness. Are you more than 50% sure? No? Then don’t guess.
In fact, it shares several properties with the best system ever devised (for multiple choice questions, at least): the test-taker assigns a probability to each of the answers (and the total probability doled out must sum to one), and is graded based on the logarithm of the probability they assigned to the correct answer. (Typically, there’s an offset so that assigning equal probability to all possibilities gives a score of 0, so that it is possible to get positive points.)
Do you have linkable results? My experience with the probability log-scoring is that, even on the first test, the median score is somewhat better than 0, there are several negative scorers, but the test-takers who received the best marks (who are both high-accuracy and high-calibration) are noticeably different from the pack, and are hardly the worst students.
The worst marks often go to students whose accuracy is high, but whose calibration is low, but that goes away once they learn calibration, which seems like a feature, not a bug.
How can poor students get lucky if they don’t venture answers to questions where they are not sure?
The trouble with this approach is that you then are also grading speed and resistance to mental fatigue. In some cases, that is desirable; in others, not.