The main problem is that a test tests ability to take the test, independently of what its makers intended. The more similar tests are to each other, the more taking the first is training for the second, and the easier it is to teach directly to the test rather than to the skill that inspired the test. The less similar the before and after tests are, the less comparable they are.
Rationality training is particularly tricky because one is to learn formal models of both straight and twisted thinking, recognize when real-life situations resemble those patterns, and then decide how much formal treatment to give the situation, as well as how much weight to give to one’s formal model as against one’s feelings, reflexive thoughts, and so on.
Traditional classroom tests are set up to best test the first bit, knowledge of the formal models, if one did solve the problems inherent in testing. Even to the extent one can ask people about how one ought to react in the field, e.g. when to use which sort of calculation, that is still a question with a correct answer according to a formal model and one is still not testing the ability to apply it!
These problems resemble those the military has faced in its training and testing. They use indoctrination, simulations, and field tests. Decision making is tested under uncomfortable conditions, ensuring probable good decision making under most circumstances. In general, knowing what they do is likely to be helpful.
The problems with tests are not intractable. One can limit the gain on the second test from having taken the first test by saturating the test taker with knowledge of the test before it is taken the first time, though few would be motivated. One can try to make a test similar to the skill tested, so ability at the test is well correlated with the skill one intends to test. One can try to devise very different sorts of tests that measure the same thing (I doubt that will work here).
One component of a useful classroom test might resemble the classic research on correspondence bias. In it, people judge individuals’ support for positions based off an essay they supposedly wrote. Some subjects are told that the writer chose the thesis, others that the writer had it assigned. (The theses were either pro- or anti-Castro.) People inferred that the essay’s author significantly agreed with the thesis even when they were told it was assigned to them. The quality of an essay a person produces is some evidence of what they believe, as is their willingness to write it at all, etc., but in general people overly infer others’ dispositions from actions they take under social constraint, even when they know of the constraint.
Here is how the framework could translate into a useful rationality test: the test would give people some evidence for something they are biased to overly believe, and the quantity and quality of legitimate evidence in the test would vary widely. One would not be able to pass the test by simply detecting the bias and then declare oneself unmoved in that wrong direction, as one might be able to do for, say, sunk costs. Instead, the valid evidence and invalid inclination would be along the same vector such that one would have to distinguish the bias from the rest of the evidence in the environment.
This solves the problem of having a classroom test be an easy exercise of spotting the biased thought pattern and quashing it. Videos or essays of various people with known beliefs arguing for or against those beliefs could be used to train and test people in this. It’s actually probably a skill one could learn without any idea of how one was doing it.
Expressed abstractly, the idea is to test for ability to quantify wrong thinking by mixing it with legitimate evidence, all of which increases confidence in a particular conclusion. This is hard to game because the hard part isn’t recognizing the bias. The material’s being media from real life prevents testers from imposing an unrealistic model that ignores actual evidence (e.g., a strongly pro-Castro person really might refuse to write an anti-Castro essay).
Ah, yes, that is indeed the first thing one should work on, otherwise the MW (Must Win) interpretation of Rationality is little better than the MW (Many Worlds) interpretation of Quantum Mechanics. I didn’t realize that, after all this time, there are still no objective metrics to measure the success of the course. I wish I had good ideas as to how to experimentally measure rationality, but alas. Hopefully other forum regulars do. Or maybe EY can spend some time thinking about it.
I guess an obvious way to start is to score a particular behavior based on some objective criteria, like the pass/fail on those sunk cost situations Anna (?) linked here some time ago. It’s not nearly as good as actually putting people into the circumstances where they have to apply their newly learned skills (such as detecting confusion, recognizing cognitive dissonance, what have you), but it’s a start.
As a next step, my guess is that if you look through the standard psychological experiments (maybe something less drastic and notorious than the Stanford prison experiment), you will find quite a number of them that can be cheaply replicated in a controlled setting like a mini-camp. I’m sure that gwern can dig up a whole whack of them in no time flat. Or maybe you are already doing this, for all I know. The important thing is that the participants should be inside the situations, not outside of them, and hopefully unaware that they are being tested. I guess it is sort of similar to giving two sets of CRTs, before and after.
That’s not something to ask people, that’s something you ought to actually measure before and after, otherwise what kind of rationalists are you.
Would you like to help us develop our rationality metrics? It’s a fairly difficult problem. We can’t just give people the CRT before and after a camp.
The main problem is that a test tests ability to take the test, independently of what its makers intended. The more similar tests are to each other, the more taking the first is training for the second, and the easier it is to teach directly to the test rather than to the skill that inspired the test. The less similar the before and after tests are, the less comparable they are.
Rationality training is particularly tricky because one is to learn formal models of both straight and twisted thinking, recognize when real-life situations resemble those patterns, and then decide how much formal treatment to give the situation, as well as how much weight to give to one’s formal model as against one’s feelings, reflexive thoughts, and so on.
Traditional classroom tests are set up to best test the first bit, knowledge of the formal models, if one did solve the problems inherent in testing. Even to the extent one can ask people about how one ought to react in the field, e.g. when to use which sort of calculation, that is still a question with a correct answer according to a formal model and one is still not testing the ability to apply it!
These problems resemble those the military has faced in its training and testing. They use indoctrination, simulations, and field tests. Decision making is tested under uncomfortable conditions, ensuring probable good decision making under most circumstances. In general, knowing what they do is likely to be helpful.
The problems with tests are not intractable. One can limit the gain on the second test from having taken the first test by saturating the test taker with knowledge of the test before it is taken the first time, though few would be motivated. One can try to make a test similar to the skill tested, so ability at the test is well correlated with the skill one intends to test. One can try to devise very different sorts of tests that measure the same thing (I doubt that will work here).
One component of a useful classroom test might resemble the classic research on correspondence bias. In it, people judge individuals’ support for positions based off an essay they supposedly wrote. Some subjects are told that the writer chose the thesis, others that the writer had it assigned. (The theses were either pro- or anti-Castro.) People inferred that the essay’s author significantly agreed with the thesis even when they were told it was assigned to them. The quality of an essay a person produces is some evidence of what they believe, as is their willingness to write it at all, etc., but in general people overly infer others’ dispositions from actions they take under social constraint, even when they know of the constraint.
Here is how the framework could translate into a useful rationality test: the test would give people some evidence for something they are biased to overly believe, and the quantity and quality of legitimate evidence in the test would vary widely. One would not be able to pass the test by simply detecting the bias and then declare oneself unmoved in that wrong direction, as one might be able to do for, say, sunk costs. Instead, the valid evidence and invalid inclination would be along the same vector such that one would have to distinguish the bias from the rest of the evidence in the environment.
This solves the problem of having a classroom test be an easy exercise of spotting the biased thought pattern and quashing it. Videos or essays of various people with known beliefs arguing for or against those beliefs could be used to train and test people in this. It’s actually probably a skill one could learn without any idea of how one was doing it.
Expressed abstractly, the idea is to test for ability to quantify wrong thinking by mixing it with legitimate evidence, all of which increases confidence in a particular conclusion. This is hard to game because the hard part isn’t recognizing the bias. The material’s being media from real life prevents testers from imposing an unrealistic model that ignores actual evidence (e.g., a strongly pro-Castro person really might refuse to write an anti-Castro essay).
Ah, yes, that is indeed the first thing one should work on, otherwise the MW (Must Win) interpretation of Rationality is little better than the MW (Many Worlds) interpretation of Quantum Mechanics. I didn’t realize that, after all this time, there are still no objective metrics to measure the success of the course. I wish I had good ideas as to how to experimentally measure rationality, but alas. Hopefully other forum regulars do. Or maybe EY can spend some time thinking about it.
I guess an obvious way to start is to score a particular behavior based on some objective criteria, like the pass/fail on those sunk cost situations Anna (?) linked here some time ago. It’s not nearly as good as actually putting people into the circumstances where they have to apply their newly learned skills (such as detecting confusion, recognizing cognitive dissonance, what have you), but it’s a start.
As a next step, my guess is that if you look through the standard psychological experiments (maybe something less drastic and notorious than the Stanford prison experiment), you will find quite a number of them that can be cheaply replicated in a controlled setting like a mini-camp. I’m sure that gwern can dig up a whole whack of them in no time flat. Or maybe you are already doing this, for all I know. The important thing is that the participants should be inside the situations, not outside of them, and hopefully unaware that they are being tested. I guess it is sort of similar to giving two sets of CRTs, before and after.