It seems conventional wisdom that tests are generally gameable in the sense that an (most?) effective way to produce the best scores involves teaching password guessing rather than actually learning material deeply, i.e. such that the student can use it in novel and useful ways. Indeed, I think this is the case for many (most, even) tests, but also think it possible to write tests that are most easily passed by learning the material deeply. In particular, I don’t see how to game questions like “state, prove, and provide an intuitive justification for Pascal’s combinatorial identity” or “Under what conditions does f(x) = ax^3 + bx^2 + cx + d have only one critical point?″, but that’s more a statement about my mind than the gameability of tests. I would greatly appreciate learning how a test consisting of such questions could be gamed, thereby unlearning an untrue thing; and if no one here can (or, at least, is willing to take the time to) explain how such a thing could be done, well, that’s useful to know, too.
One easy way I can think of gaming such a test is to figure out ahead of time that those questions will be the ones on the test, then look up an answer for just that question, and parrot it on the actual test.
I know at my college, there were databases of just about every professor’s exams for the past several years. Most of them re-used enough questions that you could get a pretty good idea of what was going to be on the exams, just by looking at past exams. A lot of people would spend a lot of time studying old exams to game this process instead of actually learning the material.
If we assume that the questions are designed such that a student can answer them upon initial exposure if and only if they deeply understand the material, then the question of identifying graders turns into the much easier question of identifying people who can discriminate between valid and invalid answers. I’m told that being able to discriminate between valid and invalid responses is a necessary condition for subject expertise, so anyone who’s a relevant expert works. One way to demonstrate expertise is by building something that requires expertise. In an extreme example, I’m confident that Grigori Perelman understands topology because he proved the Poincare conjecture, and, for similar reasons, I’m (mostly) confident that Ph.Ds are experts. If we have well-designed tests, we can set the set of people qualified to grade tests as “has built something requiring expertise or has passed a well-designed test graded by someone already in this set.”
How would that be different from grading things under the current system?
Maybe exams should be layered—first, a test on basic terminology understanding, then a lottery of commenting on scientific research (1 ticket = 1 article).
If you’re after feedback-for-understanding, providing a student with a list of questions they got wrong and a good solutions manual (which you only have to write once) works most of the time (my guess is around 90% of the time, but I have low confidence in my estimates because I’m capable of successfully working through entire textbooks’ worth of material and needing no human feedback, which I’m told is not often the case). Doing this should be more effective than having the error explained outright a la generation effect.
Another interesting result is that the best feedback for fostering understanding often comes not from experts, who have such a deep degree of understanding and automaticity that it impairs their ability to simulate and communicate with minds struggling with new material, but from students who just learned the material. There’s a risk of students who believe the right thing for the wrong reason propagating their misunderstanding, but I think that pairing up a student who’s struggling with some concept (i.e., throwing a solutions manual at them hasn’t helped them bridge the conceptual gap that caused them to get the question wrong) with a student who understands it is often helpful. IIRC, Sal Khan described using this technique with some success in his book; a friend/mentor who teaches secondary math and keeps up with the literature tells me this works; and I’ve used this basic technique doing an enrichment afterschool program for the local Mathcounts team after the season had ended and can only describe its efficacy as “definitely witchcraft”.
I think there’s a place for graders to give detailed feedback to bad answers, but most of the time, it’s better to force students to do the work themselves and locate their own errors/conceptual gaps, and in most of the remaining cases, to pawn off the responsibility to students (this could be construed as teachers being lazy, but it’s also what, to my knowledge, produces the best learning outcomes). Since detailed feedback is only desirable after two rounds of other approaches that (in my deeply nonrepresentative experience) usually work, I don’t think it makes sense to produce detailed feedback to every wrong answer.
Then again, I don’t fully understand what context you’re thinking in. In my original post, I was thinking about purely diagnostic math tests given to postsecondary students for employers that wouldn’t so much as tell students which questions they got wrong, along the lines of the Royal Statistical Society’s Graduate Diploma (five three-hour tests which grant a credential equivalent to a “good UK honours degree”). In writing this, I’m mostly imagining standardized math tests for secondary students in America (which, I’m given to understand, already have written components), which currently don’t give per-question feedback, but changing that is much less of a pipe dream than creating tests that effectively test understanding. Come to think of it, I think the above approach applies even better to classroom instructors giving their own tests, at either the secondary or postsecondary level.
Tangentially related: the best professor I ever had would type 3–4 pages of general commentary (common errors and why they were wrong and how to do them better, as well as things the class did well) for the class after every problem set and test, generally by the next class. I found this commentary was extraordinarily helpful, not just because of feedback, but because (a) it helped dispel the misperception that everyone else understood everything and I was struggling because I was stupid, (b) taught us to discriminate between bad, mediocre, and good work, and (c) comments like “most of you did [x], which was suboptimal because of [y], but one of you did [z], which takes a bit more work but is a better approach because [~y]” really drove me to not do the minimum amount of work to get an answer when I could do a bit more work to get a stronger solution. (The course was in numerical methods so, as an example, we once had a problem where we had to use some technique where error exploded (I’ve now forgotten since I didn’t have Anki back then) to locate a typo in some numeric data. A sufficient answer would have been to identify the incorrect entry; a stronger answer was to identify the incorrect entry, figure out the error (two digits typed in the wrong order), and demonstrate that fixing the error caused explosions to not happen.)
Another interesting result is that the best feedback for fostering understanding often comes not from experts, who have such a deep degree of understanding and automaticity that it impairs their ability to simulate and communicate with minds struggling with new material, but from students who just learned the material
The core material for teaching is not the subject to be taught, but human confusions about that subject.
providing a student with a list of questions they got wrong and a good solutions manual … works most of the time … I’m capable of successfully working through entire textbooks’ worth of material and needing no human feedback, which I’m told is not often the case
That’s a very important point. My impression is that people can be divided into two general categories—those who learn best by themselves; and those who learn best when being taught by someone.
I suspect that most people on LW prefer to inhale textbooks on their own. I also suspect that most people outside of LW prefer to have a teacher guide them.
Testing and credentialism is a mess. The basic problem is that it’s unclear what the result should measure: how much the student knows, how much the student has learned, how intelligent the student is, how conscientious, or how well the student’s capabilities line up with the topic. The secondary problem is that in most settings, the test should be both hard-to-game AND perfectly objective, such that there is no argument about correctness of answer (and such that grading can be done quickly).
I spend a lot of time interviewing and training interviewers for tech jobs. This doesn’t have the first problem: we have a clear goal (determine whether the candidate is likely to perform well in the role, usually tested by solving similar problems as would be faced in the role). The second difficulty is similar—a good interview generates actual evidence of the candidate’s likely success, not just domain knowledge. This takes a lot of interviewing skill to get the best from the candidate, and a lot of judgement in how to evaluate the approach and weigh the various aspects tested. We put a lot of time into this, and accept the judgement aspect rather than trying to reduce the time spent, automate the results, or be purely objective in assessment.
It seems conventional wisdom that tests are generally gameable in the sense that an (most?) effective way to produce the best scores involves teaching password guessing rather than actually learning material deeply, i.e. such that the student can use it in novel and useful ways. Indeed, I think this is the case for many (most, even) tests, but also think it possible to write tests that are most easily passed by learning the material deeply. In particular, I don’t see how to game questions like “state, prove, and provide an intuitive justification for Pascal’s combinatorial identity” or “Under what conditions does f(x) = ax^3 + bx^2 + cx + d have only one critical point?″, but that’s more a statement about my mind than the gameability of tests. I would greatly appreciate learning how a test consisting of such questions could be gamed, thereby unlearning an untrue thing; and if no one here can (or, at least, is willing to take the time to) explain how such a thing could be done, well, that’s useful to know, too.
One easy way I can think of gaming such a test is to figure out ahead of time that those questions will be the ones on the test, then look up an answer for just that question, and parrot it on the actual test.
I know at my college, there were databases of just about every professor’s exams for the past several years. Most of them re-used enough questions that you could get a pretty good idea of what was going to be on the exams, just by looking at past exams. A lot of people would spend a lot of time studying old exams to game this process instead of actually learning the material.
How do you identify people who can grade answers to questions which show deep understanding?
If we assume that the questions are designed such that a student can answer them upon initial exposure if and only if they deeply understand the material, then the question of identifying graders turns into the much easier question of identifying people who can discriminate between valid and invalid answers. I’m told that being able to discriminate between valid and invalid responses is a necessary condition for subject expertise, so anyone who’s a relevant expert works. One way to demonstrate expertise is by building something that requires expertise. In an extreme example, I’m confident that Grigori Perelman understands topology because he proved the Poincare conjecture, and, for similar reasons, I’m (mostly) confident that Ph.Ds are experts. If we have well-designed tests, we can set the set of people qualified to grade tests as “has built something requiring expertise or has passed a well-designed test graded by someone already in this set.”
More importantly, how do you persuade these people that they should spend their (presumably, valuable) time grading mostly stupid answers?
How would that be different from grading things under the current system?
Maybe exams should be layered—first, a test on basic terminology understanding, then a lottery of commenting on scientific research (1 ticket = 1 article).
Under the current system, grading is (relatively) easy because all you need to check is whether the answer given matches the correct one.
The proposed system would involve answers that cannot be mechanically pattern-matched and would need a LOT more time and effort to grade.
You pay them. You also tell them that their job is to identify good answers, not to give detailed feedback to bad answers.
If your goal is to foster understanding instead of giving canned answers, this seems counterproductive.
If you’re after feedback-for-understanding, providing a student with a list of questions they got wrong and a good solutions manual (which you only have to write once) works most of the time (my guess is around 90% of the time, but I have low confidence in my estimates because I’m capable of successfully working through entire textbooks’ worth of material and needing no human feedback, which I’m told is not often the case). Doing this should be more effective than having the error explained outright a la generation effect.
Another interesting result is that the best feedback for fostering understanding often comes not from experts, who have such a deep degree of understanding and automaticity that it impairs their ability to simulate and communicate with minds struggling with new material, but from students who just learned the material. There’s a risk of students who believe the right thing for the wrong reason propagating their misunderstanding, but I think that pairing up a student who’s struggling with some concept (i.e., throwing a solutions manual at them hasn’t helped them bridge the conceptual gap that caused them to get the question wrong) with a student who understands it is often helpful. IIRC, Sal Khan described using this technique with some success in his book; a friend/mentor who teaches secondary math and keeps up with the literature tells me this works; and I’ve used this basic technique doing an enrichment afterschool program for the local Mathcounts team after the season had ended and can only describe its efficacy as “definitely witchcraft”.
I think there’s a place for graders to give detailed feedback to bad answers, but most of the time, it’s better to force students to do the work themselves and locate their own errors/conceptual gaps, and in most of the remaining cases, to pawn off the responsibility to students (this could be construed as teachers being lazy, but it’s also what, to my knowledge, produces the best learning outcomes). Since detailed feedback is only desirable after two rounds of other approaches that (in my deeply nonrepresentative experience) usually work, I don’t think it makes sense to produce detailed feedback to every wrong answer.
Then again, I don’t fully understand what context you’re thinking in. In my original post, I was thinking about purely diagnostic math tests given to postsecondary students for employers that wouldn’t so much as tell students which questions they got wrong, along the lines of the Royal Statistical Society’s Graduate Diploma (five three-hour tests which grant a credential equivalent to a “good UK honours degree”). In writing this, I’m mostly imagining standardized math tests for secondary students in America (which, I’m given to understand, already have written components), which currently don’t give per-question feedback, but changing that is much less of a pipe dream than creating tests that effectively test understanding. Come to think of it, I think the above approach applies even better to classroom instructors giving their own tests, at either the secondary or postsecondary level.
Tangentially related: the best professor I ever had would type 3–4 pages of general commentary (common errors and why they were wrong and how to do them better, as well as things the class did well) for the class after every problem set and test, generally by the next class. I found this commentary was extraordinarily helpful, not just because of feedback, but because (a) it helped dispel the misperception that everyone else understood everything and I was struggling because I was stupid, (b) taught us to discriminate between bad, mediocre, and good work, and (c) comments like “most of you did [x], which was suboptimal because of [y], but one of you did [z], which takes a bit more work but is a better approach because [~y]” really drove me to not do the minimum amount of work to get an answer when I could do a bit more work to get a stronger solution. (The course was in numerical methods so, as an example, we once had a problem where we had to use some technique where error exploded (I’ve now forgotten since I didn’t have Anki back then) to locate a typo in some numeric data. A sufficient answer would have been to identify the incorrect entry; a stronger answer was to identify the incorrect entry, figure out the error (two digits typed in the wrong order), and demonstrate that fixing the error caused explosions to not happen.)
The core material for teaching is not the subject to be taught, but human confusions about that subject.
That’s a very important point. My impression is that people can be divided into two general categories—those who learn best by themselves; and those who learn best when being taught by someone.
I suspect that most people on LW prefer to inhale textbooks on their own. I also suspect that most people outside of LW prefer to have a teacher guide them.
Fair point—I’d spaced out on this being for a class rather than an employer looking for clueful people.
Testing and credentialism is a mess. The basic problem is that it’s unclear what the result should measure: how much the student knows, how much the student has learned, how intelligent the student is, how conscientious, or how well the student’s capabilities line up with the topic. The secondary problem is that in most settings, the test should be both hard-to-game AND perfectly objective, such that there is no argument about correctness of answer (and such that grading can be done quickly).
I spend a lot of time interviewing and training interviewers for tech jobs. This doesn’t have the first problem: we have a clear goal (determine whether the candidate is likely to perform well in the role, usually tested by solving similar problems as would be faced in the role). The second difficulty is similar—a good interview generates actual evidence of the candidate’s likely success, not just domain knowledge. This takes a lot of interviewing skill to get the best from the candidate, and a lot of judgement in how to evaluate the approach and weigh the various aspects tested. We put a lot of time into this, and accept the judgement aspect rather than trying to reduce the time spent, automate the results, or be purely objective in assessment.