My impression of where this would lead is something like this:
While enormous amounts of work has been done globally to develop and employ epistemic aids, we have relatively little study being done to explore which epistemic interventions are most useful for specific problems.
We can envision an analog to the medical system. Instead of diagnosing physical sickness, it diagnoses epistemic illness and prescribes solutions on the basis of evidence.
We can also envision two wings of this hypothetical system. One is the “public epistemic health” wing, which studies mass interventions. Another is patient-centered epistemic medicine, which focuses on the problems of individual people or teams.
“Effective epistemics” is the attempt to move toward mechanistic theories of epistemology that are equivalent in explanatory power to the germ theory of disease. Whether such mechanistic theories can be found remains to be seen. But there was also a time during which medical research was forced to proceed without a germ theory of disease. We’d never have gotten medicine to the point where it is today if early scientists had said “we don’t know what causes disease, so what’s the point in studying it?”
So having a reasonable expectation that formal study would uncover mechanisms with equivalent explanatory power would be a good use of resources, considering the extreme importance of correct decision-making for every problem humanity confronts.
Is this a good way to look at what you’re trying to do?
There’s a whole lot to “making people more correct about things.” I’m personally a lot less focused on trying to make sure the “masses” believe things we already know, than I am in improving the epistemic abilities of “best” groups. From where I’m standing, I imagine even the “best” people have a long way to improve. I personally barely feel confident about a bunch of things and am looking for solutions where I could be more confident. More “super intense next level prediction markets” and less “fighting conspiracy theories”.
I do find the topic of epistemics of “the masses” to be interesting, it’s just different. CSER did some work in this area, and I also liked the podcast about Taiwan’s approach to it (treating lies using epidemic models, similar to how you mention.)
I have an idea along these lines: adversarial question-asking.
I have a big concern about various forms of forecasting calibration.
Each forecasting team establishes its reputation by showing that its predictions, in aggregate, are well-calibrated and accurate on average.
However, questions are typically posed by a questioner who’s part of the forecasting team. This creates an opportunity for them to ask a lot of softball questions that are easy for an informed forecaster to answer correctly, or at least to calibrate their confidence on.
By advertising their overall level of calibration and average accuracy, they can “dilute away” inaccuracies on hard problems that other people really care about. They gain a reputation for accuracy, yet somehow don’t seem so accurate when we pose a truly high-stakes question to them.
This problem could be at least partly solved by having an external, adversarial question-asker. Even better would be some sort of mechanical system for generating the questions that forecasters must answer.
For example, imagine that you had a way to extract every objectively answerable question posed by the New York Times in 2021.
Currently, their headline article is “Duty or Party? For Republicans, a Test of Whether to Enable Trump”
Though it does not state this in so many words, one of the primary questions it raises is whether the Michigan board that certifies vote results will certify Biden’s victory ahead of the Electoral College vote on Dec. 14.
Imagine that one team’s job was to extract such questions from a newspaper. Then they randomly selected a certain number of them each day, and posed them to a team of forecasters.
In this way, the work of superforecasters would be chained to the concerns of the public, rather than spent on questions that may or may not be “hackable.”
To me, this is a critically important, and to my knowledge totally unexplored question that I would very much like to see treated.
Comparing groups of forecasters who worked on different question sets only using simple accuracy measures like brier scores is basically not feasible. You’re right that forecasters can prioritize easier questions and do other hacks.
I don’t get the impression that platforms like Metaculus or GJP bias their questions much to achieve higher brier scores. This is one reason why they typically focus more on their calibration graphs, and on direct question comparisons between platforms.
All that said, I definitely think we have a lot of room to get better at doing comparisons of forecasting between platforms.
I’m less interested in comparing groups of forecasters with each other based on brier scores than with getting a referendum on forecasting generally.
The forecasting industry has a collective interest in maintaining their reputation for predictive accuracy on general questions. I want to know if they are in fact accurate in general questions, or whether some of their apparent success rests on choosing the questions that they address with some cunning.
My impression of where this would lead is something like this:
While enormous amounts of work has been done globally to develop and employ epistemic aids, we have relatively little study being done to explore which epistemic interventions are most useful for specific problems.
We can envision an analog to the medical system. Instead of diagnosing physical sickness, it diagnoses epistemic illness and prescribes solutions on the basis of evidence.
We can also envision two wings of this hypothetical system. One is the “public epistemic health” wing, which studies mass interventions. Another is patient-centered epistemic medicine, which focuses on the problems of individual people or teams.
“Effective epistemics” is the attempt to move toward mechanistic theories of epistemology that are equivalent in explanatory power to the germ theory of disease. Whether such mechanistic theories can be found remains to be seen. But there was also a time during which medical research was forced to proceed without a germ theory of disease. We’d never have gotten medicine to the point where it is today if early scientists had said “we don’t know what causes disease, so what’s the point in studying it?”
So having a reasonable expectation that formal study would uncover mechanisms with equivalent explanatory power would be a good use of resources, considering the extreme importance of correct decision-making for every problem humanity confronts.
Is this a good way to look at what you’re trying to do?
Kudos for the thinking here, I like the take.
There’s a whole lot to “making people more correct about things.” I’m personally a lot less focused on trying to make sure the “masses” believe things we already know, than I am in improving the epistemic abilities of “best” groups. From where I’m standing, I imagine even the “best” people have a long way to improve. I personally barely feel confident about a bunch of things and am looking for solutions where I could be more confident. More “super intense next level prediction markets” and less “fighting conspiracy theories”.
I do find the topic of epistemics of “the masses” to be interesting, it’s just different. CSER did some work in this area, and I also liked the podcast about Taiwan’s approach to it (treating lies using epidemic models, similar to how you mention.)
I have an idea along these lines: adversarial question-asking.
I have a big concern about various forms of forecasting calibration.
Each forecasting team establishes its reputation by showing that its predictions, in aggregate, are well-calibrated and accurate on average.
However, questions are typically posed by a questioner who’s part of the forecasting team. This creates an opportunity for them to ask a lot of softball questions that are easy for an informed forecaster to answer correctly, or at least to calibrate their confidence on.
By advertising their overall level of calibration and average accuracy, they can “dilute away” inaccuracies on hard problems that other people really care about. They gain a reputation for accuracy, yet somehow don’t seem so accurate when we pose a truly high-stakes question to them.
This problem could be at least partly solved by having an external, adversarial question-asker. Even better would be some sort of mechanical system for generating the questions that forecasters must answer.
For example, imagine that you had a way to extract every objectively answerable question posed by the New York Times in 2021.
Currently, their headline article is “Duty or Party? For Republicans, a Test of Whether to Enable Trump”
Though it does not state this in so many words, one of the primary questions it raises is whether the Michigan board that certifies vote results will certify Biden’s victory ahead of the Electoral College vote on Dec. 14.
Imagine that one team’s job was to extract such questions from a newspaper. Then they randomly selected a certain number of them each day, and posed them to a team of forecasters.
In this way, the work of superforecasters would be chained to the concerns of the public, rather than spent on questions that may or may not be “hackable.”
To me, this is a critically important, and to my knowledge totally unexplored question that I would very much like to see treated.
Comparing groups of forecasters who worked on different question sets only using simple accuracy measures like brier scores is basically not feasible. You’re right that forecasters can prioritize easier questions and do other hacks.
This post goes into detail on several incentive problems:
https://forum.effectivealtruism.org/posts/ztmBA8v6KvGChxw92/incentive-problems-with-current-forecasting-competitions
I don’t get the impression that platforms like Metaculus or GJP bias their questions much to achieve higher brier scores. This is one reason why they typically focus more on their calibration graphs, and on direct question comparisons between platforms.
All that said, I definitely think we have a lot of room to get better at doing comparisons of forecasting between platforms.
I’m less interested in comparing groups of forecasters with each other based on brier scores than with getting a referendum on forecasting generally.
The forecasting industry has a collective interest in maintaining their reputation for predictive accuracy on general questions. I want to know if they are in fact accurate in general questions, or whether some of their apparent success rests on choosing the questions that they address with some cunning.