I seem to be years late to this party, but I’ve heard the LW culture isn’t opposed to commenting on old posts. In the interest of “breadth” I’ll answer anyway after at least five minutes of thought, without looking at the other answers first (though I’ve probably seen subsequent posts that have been influenced by this one by now).
So there are three categories of tests here. In order of strictness: those for masters, those for students, and those for employees?
There are many skills under the “rationality” umbrella. Enumerate them and test separately. Maybe there are some we don’t know yet. How do we test for those? There’s also a difference between epistemic and instrumental rationality. Epistemic seems easier to test and is probably required for instrumental. But instrumental is what we really want. Some of my test suggestions will only test a part of “rationality”.
Schools and science have a lot of experience measuring things like this. Can we learn from them?
Every test I’ve come up with seems to be in one of two categories: toy problems, or real-life problems. The real-life problems are better for the masters, perhaps; and the toy problems for the students. The toy problems are less real, but more replicable. I thought we’re supposed to hold off on proposing solutions to avoid attractors like categorization prematurely limiting our scope. But we’ve been asked to brainstorm. Can we break out of these categories?
Some Ideas:
Give the students a sum to invest for a small business, and time limit, then see how much they make. Require strict record keeping to prevent cheating. Noisy.
Give them a sum to invest in a prediction market, then see how much they make.
Use more direct calibration tests. Make students give probabilities for things. See how often they’re right.
A student must catch specific examples of cognitive errors/fallacies in a video. (Arguably the important part is to catch one’s own errors, and the ability to find others’ errors doesn’t prove that.)
Make a student write an essay before the term. The instructor will find examples of cognitive errors in it, but keep it secret. Then after the term, the student must review his essay and find as many errors in his former thinking as possible. This will measure personal improvement, but might not help measure relative to peers, since they’re all taking different “tests”.
SAT-style multiple-choice exam. This can test knowledge of the material, and synthesis too (or so the test writers claim) to a limited extent.
Like the three integers test, the master can play the role of nature, while the students play the role of a scientist, trying to figure out a simple rule by “experiment”. Grading can be on the number of questions asked, the time taken, the difficulty of the rule, or the number of these questions answered correctly. The instructor must be strictly forbidden from giving hints that could ruin the results. This is actually very similar to debugging software. Maybe this kind of test could be computerized, with “nature” as an opaque program and, students writing code that interacts with it as their “experiments”. They then may have to write code to emulate the rule. If it passes the unit tests, a human instructor can confirm if it implements the same rule. This can also give students a feel for what it’s like to do science correctly.
Competition programming AI to win at game-theory-inspired challenges. See how they compare to well-known strategies. May be hard to keep challenges secret. Could the payoff grid be randomized?
Life outcome survey over years. Are they “winning” more versus control group? May be hard to define. Slow. We should do this, but we shouldn’t wait for it before developing the program.
Masters can actually try to accomplish something. Maybe improve life outcomes in a third-world country or something. To be meaningful, it would have a control group, competitors, a time limit and a budget.
I seem to be years late to this party, but I’ve heard the LW culture isn’t opposed to commenting on old posts. In the interest of “breadth” I’ll answer anyway after at least five minutes of thought, without looking at the other answers first (though I’ve probably seen subsequent posts that have been influenced by this one by now).
So there are three categories of tests here. In order of strictness: those for masters, those for students, and those for employees?
There are many skills under the “rationality” umbrella. Enumerate them and test separately. Maybe there are some we don’t know yet. How do we test for those? There’s also a difference between epistemic and instrumental rationality. Epistemic seems easier to test and is probably required for instrumental. But instrumental is what we really want. Some of my test suggestions will only test a part of “rationality”.
Schools and science have a lot of experience measuring things like this. Can we learn from them?
Every test I’ve come up with seems to be in one of two categories: toy problems, or real-life problems. The real-life problems are better for the masters, perhaps; and the toy problems for the students. The toy problems are less real, but more replicable. I thought we’re supposed to hold off on proposing solutions to avoid attractors like categorization prematurely limiting our scope. But we’ve been asked to brainstorm. Can we break out of these categories?
Some Ideas:
Give the students a sum to invest for a small business, and time limit, then see how much they make. Require strict record keeping to prevent cheating. Noisy.
Give them a sum to invest in a prediction market, then see how much they make.
Use more direct calibration tests. Make students give probabilities for things. See how often they’re right.
A student must catch specific examples of cognitive errors/fallacies in a video. (Arguably the important part is to catch one’s own errors, and the ability to find others’ errors doesn’t prove that.)
Make a student write an essay before the term. The instructor will find examples of cognitive errors in it, but keep it secret. Then after the term, the student must review his essay and find as many errors in his former thinking as possible. This will measure personal improvement, but might not help measure relative to peers, since they’re all taking different “tests”.
SAT-style multiple-choice exam. This can test knowledge of the material, and synthesis too (or so the test writers claim) to a limited extent.
Like the three integers test, the master can play the role of nature, while the students play the role of a scientist, trying to figure out a simple rule by “experiment”. Grading can be on the number of questions asked, the time taken, the difficulty of the rule, or the number of these questions answered correctly. The instructor must be strictly forbidden from giving hints that could ruin the results. This is actually very similar to debugging software. Maybe this kind of test could be computerized, with “nature” as an opaque program and, students writing code that interacts with it as their “experiments”. They then may have to write code to emulate the rule. If it passes the unit tests, a human instructor can confirm if it implements the same rule. This can also give students a feel for what it’s like to do science correctly.
Competition programming AI to win at game-theory-inspired challenges. See how they compare to well-known strategies. May be hard to keep challenges secret. Could the payoff grid be randomized?
Life outcome survey over years. Are they “winning” more versus control group? May be hard to define. Slow. We should do this, but we shouldn’t wait for it before developing the program.
Masters can actually try to accomplish something. Maybe improve life outcomes in a third-world country or something. To be meaningful, it would have a control group, competitors, a time limit and a budget.