As Cowen and Hanson put it, “Merely knowing someone else’s opinion provides a powerful summary of everything that person knows, powerful enough to eliminate any differences of opinion due to differing information.” So sharing evidence the normal way shouldn’t be necessary.
This is one of the loonier[1] ideas to be found on Overcoming Bias (and that’s quite saying something). Exercise for the reader: test this idea that sharing opinions screens off the usefulness of sharing evidence with the following real-world scenario. I have participated in this scenario several times and know what the correct answer is.
You are on the programme committee of a forthcoming conference, which is meeting to decide which of the submitted papers to accept. Each paper has been refereed by several people, each of whom has given a summary opinion (definite accept, weak accept, weak reject, or definite reject) and supporting evidence for the opinion.
To transact business most efficiently, some papers are judged solely on the summary opinions. Every paper rated a definite accept by every referee for that paper is accepted without further discussion, because if three independent experts all think it’s excellent, it probably is, and further discussion is unlikely to change that decision. Similarly, every paper firmly rejected by every referee is rejected. For papers that get a uniformly mediocre rating, the committee have to make some judgement about where to draw the line between filling out the programme and maintaining a high standard.
That leaves a fourth class: papers where the referees disagree sharply. Here is a paper where three referees say definitely accept, one says definitely reject. On another paper, it’s the reverse. Another, two each way.
How should the committee decide on these papers? By combining the opinions only, or by reading the supporting evidence?
ETA: [1] By which I mean not “so crazy it must be wrong” but “so wrong it’s crazy”.
This is one of the loonier[1] ideas to be found on Overcoming Bias (and that’s quite saying something). Exercise for the reader: test this idea that sharing opinions screens off the usefulness of sharing evidence with the following real-world scenario. I have participated in this scenario several times and know what the correct answer is.
Verbal abuse is not a productive response to the results of an abstract model. Extended imaginary scenarios are not a productive response either. Neither explains why the proofs are wrong or inapplicable, or if inapplicable, why they do not serve useful intellectual purposes such as proving some other claim by contradiction or serving as an ideal to aspire to. Please try to do better.
Your real world scenario tells you that sometimes sharing evidence will move judgements in the right direction.
Thinking that Robert Hanson or someone else on Overcoming Bias hasn’t thought of that argument is naive. Robert Hanson might sometimes make arguments that are wrong but he’s not stupid. If you are treating him as if he would be, then you are likely arguing against a strawman.
Apart from that your example also has strange properties like only four different kind of judgements that reviewers are allowed to make. Why would anyone choose four?
Your real world scenario tells you that sometimes sharing evidence will move judgements in the right direction.
It is a lot more than “sometimes”. In my experience (mainly in computing) no journal editor or conference chair will accept a referee’s report that provides nothing but an overall rating of the paper. The rubric for the referees often explicitly states that. Where ratings of the same paper differ substantially among referees, the reasons for those differing judgements are examined.
Apart from that your example also has strange properties like only four different kind of judgements that reviewers are allowed to make. Why would anyone choose four?
The routine varies but that one is typical. A four-point scale (sometimes with a fifth not on the same dimension: “not relevant to this conference”, which trumps the scalar rating). Sometimes they ask for different aspects to be rated separately (originality, significance, presentation, etc.). Plus, of course, the rationale for the verdict, without which the verdict will not be considered and someone else will be found to referee the paper properly.
Anyone is of course welcome to argue that they’re all doing it wrong, or to found a journal where publication is decided by simple voting rounds without discussion. However, Aumann’s theorem is not that argument, it’s not the optimal version of Delphi (according to the paper that gwern quoted), and I’m not aware of any such journal. Maybe Plos ONE? I’m not familiar with their process, but their criteria for inclusion are non-standard.
It is a lot more than “sometimes”. In my experience (mainly in computing) no journal editor or conference chair will accept a referee’s report that provides nothing but than an overall rating of the paper.
That just tells us that the journals believe that the rating isn’t the only thing that matters. But most journals just do things that make sense to them. The don’t draft their policies based on findings of decision science.
But most journals just do things that make sense to them. The don’t draft their policies based on findings of decision science.
Those findings being? Aumann’s theorem doesn’t go the distance. Anyway, I have no knowledge of how they draft their policies, merely some of what those policies are. Do you have some information to share here?
For example that likert scales are nice if you want someone to give you their opinion.
Of course it might sense to actually do run experiments. Big publishers do rule over 1000′s of journals so it should be easy for them to do the necessary research if the wanted to do so.
I think the most straightforward way is to do a second round. Let every referee read the opinions of the other referees and see whether they converge onto a shared judgement.
What actually happens is that the reasons for the summary judgements are examined.
Three for, one against. Is the dissenter the only one who has not understood the paper, or the only one who knows that although the work is good, almost the same paper has just been accepted to another conference? The set of summary judgements is the same but the right final judgement is different. Therefore there is no way to get the latter from the former.
Aumann agreement requires common knowledge of each others’ priors. When does this ever obtain? I believe Robin Hanson’s argument about pre-priors just stands the turtle on top of another turtle.
People don’t coincide in their priors, don’t have access to the same evidence and aren’t running off the same epistemology, and can’t settle epistemologiical debates non-circularly......
Threr’s a lot wrong with Aumannn, or at least the way some people use it.
What actually happens is that the reasons for the summary judgements are examined.
Really? My understanding was that
Between each iteration of the questionnaire, the facilitator or monitor team (i.e., the person or persons administering the procedure) informs group members of the opinions of their anonymous colleagues. Often this “feedback” is presented as a simple statistical summary of the group response, usually a mean or median value, such as the average group estimate of the date before which an event will occur. As such, the feedback comprises the opinions and judgments of all group members and not just the most vocal. At the end of the polling of participants (after several rounds of questionnaire iteration), the facilitator takes the group judgment as the statistical average (mean or median) of the panelists’ estimates on the final round.
(From Rowe & Wright’s “Expert opinions in forecasting: the role of the Delphi technique”, in the usual Armstrong anthology.) From the sound of it, the feedback is often purely statistical in nature, and if it wasn’t commonly such restricted feedback, it’s hard to see why Rowe & Wright would criticize Delphi studies for this:
The use of feedback in the Delphi procedure is an important feature of the technique. However, research that has compared Delphi groups to control groups in which no feedback is given to panelists (i.e., non-interacting individuals are simply asked to re-estimate their judgments or forecasts on successive rounds prior to the aggregation of their estimates) suggests that feedback is either superfluous or, worse, that it may harm judgmental performance relative to the control groups (Boje and Murnighan 1982; Parenté, et al. 1984). The feedback used in empirical studies, however, has tended to be simplistic, generally comprising means or medians alone with no arguments from panelists whose estimates fall outside the quartile ranges (the latter being recommended by the classical definition of Delphi, e.g., Rowe et al. 1991). Although Boje and Murnighan (1982) supplied some written arguments as feedback, the nature of the panelists and the experimental task probably interacted to create a difficult experimental situation in which no feedback format would have been effective.
I was referring to what actually happens in a programme committee meeting, not the Delphi method.
Fine. Then consider it an example of ‘loony’ behavior in the real world: Delphi pools, as a matter of fact, for many decades, have operated by exchanging probabilities and updating repeatedly, and in a number of cases performed well (justifying their continued usage). You don’t like Delphi pools? That’s cool too, I’ll just switch my example to prediction markets.
It would be interesting to conduct an experiment to compare the two methods for this problem. However, it is not clear how to obtain a ground truth with which to judge the correctness of the results. BTW, my further elaboration, with the example of one referee knowing that the paper under discussion was already published, was also non-fictional. It is not clear to me how any decision method that does not allow for sharing of evidence can yield the right answer for this example.
What have Delphi methods been found to perform well relative to, and for what sorts of problems?
However, it is not clear how to obtain a ground truth with which to judge the correctness of the results.
That assumes we don’t have any criteria on which to judge good versus bad scientific papers.
You could train your model to predict the amount of citations that a paper will get.
You can also look at variables such as reproduced papers or withdrawn papers.
Define a utility function that collapses such variables into a single one.
Run a real world experiment in a journal and do 50% of the paper submissions with one mechanism and 50% with the other. Let a few years go by and then you evaluate the techniques based on your utility function.
You could train your model to predict the amount of citations that a paper will get. You can also look at variables such as reproduced papers or withdrawn papers.
Define a utility function that collapses such variables into a single one. Run a real world experiment in a journal and do 50% of the paper submissions with one mechanism and 50% with the other. Let a few years go by and then you evaluate the techniques based on your utility function.
Something along those lines might be done, but an interventional experiment (creating journals just to test a hypothesis about refereeing) would be impractical. That leaves observational data-collecting, where one might compare the differing practices of existing journals. But the confounding problems would be substantial.
Or, more promisingly, you could do an experiment with papers that are already published and have a citation record, and have experimental groups of referees assess them, and test different methods of resolving disagreements. That might actually be worth doing, although it has the flaw that it would only be assessing accepted papers and not the full range of submissions.
However, it is not clear how to obtain a ground truth with which to judge the correctness of the results.
It is if you take 5 seconds to think about it and compare it to any prediction market, calibration exercise, forecasting competition, betting company, or general market: finance, geo-political events, sporting events, almanac items. Ground-truths aren’t exactly hard to come by.
What have Delphi methods been found to perform well relative to, and for what sorts of problems?
I already mentioned a review paper. It’s strange you aren’t already familiar with the strengths and weaknesses of decision & forecasting methods which involve people communicating only summaries of their beliefs to reach highly accurate results, given how loony you think these methods are and how certain of this you are.
It is if you take 5 seconds to think about it. Finance. Geo-political events. Sporting events. Almanac items.
Sorry, I was still talking (“this problem”) about the example I introduced.
I already mentioned a review paper.
Which recommends sharing “average estimates plus justifications” and “provide the mean or median estimate of the panel plus the rationales from all panellists”. They found that providing reasons was better than only statistics of the judgements (see paragraph following your second quote). As happens in the programme committee. The main difference from Delphi is that the former is not structured into rounds in the same way. The referees send in their judgements, then the committee (a smaller subset of the referees) decides.
Sorry, I was still talking (“this problem”) about the example I introduced.
If the method works on other problems, that seems like good evidence it works on your specific conference paper problem, no?
Which recommends sharing “average estimates plus justifications” and “provide the mean or median estimate of the panel plus the rationales from all panellists”.
Indeed, it does—but it says that it works better than purely statistical feedback. More information is often better. But why is that relevant? You are moving the goalposts; earlier you asked:
It is not clear to me how any decision method that does not allow for sharing of evidence can yield the right answer for this example.
I brought up prediction markets and Delphi pools because they are mechanisms which function very similarly to Aumann agreement in sharing summaries rather than evidence, and yet they work. Whether they work is not the same question as whether there is anything which could work faster, and you are replying to the former question, which is indisputably true despite your skepticism, as if it were the latter. (It’s obvious that simply swapping summaries may be slower than regular Aumannian agreement: you could imagine that instead of taking a bunch of rounds to converge, one sends all its data to the other, the other recomputes, and sends the new result back and convergence is achieved.)
This is one of the loonier[1] ideas to be found on Overcoming Bias (and that’s quite saying something). Exercise for the reader: test this idea that sharing opinions screens off the usefulness of sharing evidence with the following real-world scenario. I have participated in this scenario several times and know what the correct answer is.
You are on the programme committee of a forthcoming conference, which is meeting to decide which of the submitted papers to accept. Each paper has been refereed by several people, each of whom has given a summary opinion (definite accept, weak accept, weak reject, or definite reject) and supporting evidence for the opinion.
To transact business most efficiently, some papers are judged solely on the summary opinions. Every paper rated a definite accept by every referee for that paper is accepted without further discussion, because if three independent experts all think it’s excellent, it probably is, and further discussion is unlikely to change that decision. Similarly, every paper firmly rejected by every referee is rejected. For papers that get a uniformly mediocre rating, the committee have to make some judgement about where to draw the line between filling out the programme and maintaining a high standard.
That leaves a fourth class: papers where the referees disagree sharply. Here is a paper where three referees say definitely accept, one says definitely reject. On another paper, it’s the reverse. Another, two each way.
How should the committee decide on these papers? By combining the opinions only, or by reading the supporting evidence?
ETA: [1] By which I mean not “so crazy it must be wrong” but “so wrong it’s crazy”.
Verbal abuse is not a productive response to the results of an abstract model. Extended imaginary scenarios are not a productive response either. Neither explains why the proofs are wrong or inapplicable, or if inapplicable, why they do not serve useful intellectual purposes such as proving some other claim by contradiction or serving as an ideal to aspire to. Please try to do better.
As I said, the scenario is not imaginary.
I might have done so, had you not inserted that condescending parting shot.
Your real world scenario tells you that sometimes sharing evidence will move judgements in the right direction.
Thinking that Robert Hanson or someone else on Overcoming Bias hasn’t thought of that argument is naive. Robert Hanson might sometimes make arguments that are wrong but he’s not stupid. If you are treating him as if he would be, then you are likely arguing against a strawman.
Apart from that your example also has strange properties like only four different kind of judgements that reviewers are allowed to make. Why would anyone choose four?
It is a lot more than “sometimes”. In my experience (mainly in computing) no journal editor or conference chair will accept a referee’s report that provides nothing but an overall rating of the paper. The rubric for the referees often explicitly states that. Where ratings of the same paper differ substantially among referees, the reasons for those differing judgements are examined.
The routine varies but that one is typical. A four-point scale (sometimes with a fifth not on the same dimension: “not relevant to this conference”, which trumps the scalar rating). Sometimes they ask for different aspects to be rated separately (originality, significance, presentation, etc.). Plus, of course, the rationale for the verdict, without which the verdict will not be considered and someone else will be found to referee the paper properly.
Anyone is of course welcome to argue that they’re all doing it wrong, or to found a journal where publication is decided by simple voting rounds without discussion. However, Aumann’s theorem is not that argument, it’s not the optimal version of Delphi (according to the paper that gwern quoted), and I’m not aware of any such journal. Maybe Plos ONE? I’m not familiar with their process, but their criteria for inclusion are non-standard.
That just tells us that the journals believe that the rating isn’t the only thing that matters. But most journals just do things that make sense to them. The don’t draft their policies based on findings of decision science.
Those findings being? Aumann’s theorem doesn’t go the distance. Anyway, I have no knowledge of how they draft their policies, merely some of what those policies are. Do you have some information to share here?
For example that likert scales are nice if you want someone to give you their opinion.
Of course it might sense to actually do run experiments. Big publishers do rule over 1000′s of journals so it should be easy for them to do the necessary research if the wanted to do so.
Yes, it is. You still have not addressed what is either wrong with the proofs or why their results are not useful for any purpose.
Wow. So you started it, and now you’re going to use a much milder insult as an excuse not to participate? Please try to do better.
Well, the caravan moves on. That −1 on your comment isn’t mine, btw.
That was excessive, and I now regret having said it.
I think the most straightforward way is to do a second round. Let every referee read the opinions of the other referees and see whether they converge onto a shared judgement.
If you want a more formal name the Delphi method
What actually happens is that the reasons for the summary judgements are examined.
Three for, one against. Is the dissenter the only one who has not understood the paper, or the only one who knows that although the work is good, almost the same paper has just been accepted to another conference? The set of summary judgements is the same but the right final judgement is different. Therefore there is no way to get the latter from the former.
Aumann agreement requires common knowledge of each others’ priors. When does this ever obtain? I believe Robin Hanson’s argument about pre-priors just stands the turtle on top of another turtle.
People don’t coincide in their priors, don’t have access to the same evidence and aren’t running off the same epistemology, and can’t settle epistemologiical debates non-circularly......
Threr’s a lot wrong with Aumannn, or at least the way some people use it.
Really? My understanding was that
(From Rowe & Wright’s “Expert opinions in forecasting: the role of the Delphi technique”, in the usual Armstrong anthology.) From the sound of it, the feedback is often purely statistical in nature, and if it wasn’t commonly such restricted feedback, it’s hard to see why Rowe & Wright would criticize Delphi studies for this:
I was referring to what actually happens in a programme committee meeting, not the Delphi method.
Fine. Then consider it an example of ‘loony’ behavior in the real world: Delphi pools, as a matter of fact, for many decades, have operated by exchanging probabilities and updating repeatedly, and in a number of cases performed well (justifying their continued usage). You don’t like Delphi pools? That’s cool too, I’ll just switch my example to prediction markets.
It would be interesting to conduct an experiment to compare the two methods for this problem. However, it is not clear how to obtain a ground truth with which to judge the correctness of the results. BTW, my further elaboration, with the example of one referee knowing that the paper under discussion was already published, was also non-fictional. It is not clear to me how any decision method that does not allow for sharing of evidence can yield the right answer for this example.
What have Delphi methods been found to perform well relative to, and for what sorts of problems?
That assumes we don’t have any criteria on which to judge good versus bad scientific papers.
You could train your model to predict the amount of citations that a paper will get. You can also look at variables such as reproduced papers or withdrawn papers.
Define a utility function that collapses such variables into a single one. Run a real world experiment in a journal and do 50% of the paper submissions with one mechanism and 50% with the other. Let a few years go by and then you evaluate the techniques based on your utility function.
Something along those lines might be done, but an interventional experiment (creating journals just to test a hypothesis about refereeing) would be impractical. That leaves observational data-collecting, where one might compare the differing practices of existing journals. But the confounding problems would be substantial.
Or, more promisingly, you could do an experiment with papers that are already published and have a citation record, and have experimental groups of referees assess them, and test different methods of resolving disagreements. That might actually be worth doing, although it has the flaw that it would only be assessing accepted papers and not the full range of submissions.
Then no reason why you can’t test different procedures in an existing journal.
It is if you take 5 seconds to think about it and compare it to any prediction market, calibration exercise, forecasting competition, betting company, or general market: finance, geo-political events, sporting events, almanac items. Ground-truths aren’t exactly hard to come by.
I already mentioned a review paper. It’s strange you aren’t already familiar with the strengths and weaknesses of decision & forecasting methods which involve people communicating only summaries of their beliefs to reach highly accurate results, given how loony you think these methods are and how certain of this you are.
Sorry, I was still talking (“this problem”) about the example I introduced.
Which recommends sharing “average estimates plus justifications” and “provide the mean or median estimate of the panel plus the rationales from all panellists”. They found that providing reasons was better than only statistics of the judgements (see paragraph following your second quote). As happens in the programme committee. The main difference from Delphi is that the former is not structured into rounds in the same way. The referees send in their judgements, then the committee (a smaller subset of the referees) decides.
None of this is Aumann sharing of posteriors.
If the method works on other problems, that seems like good evidence it works on your specific conference paper problem, no?
Indeed, it does—but it says that it works better than purely statistical feedback. More information is often better. But why is that relevant? You are moving the goalposts; earlier you asked:
I brought up prediction markets and Delphi pools because they are mechanisms which function very similarly to Aumann agreement in sharing summaries rather than evidence, and yet they work. Whether they work is not the same question as whether there is anything which could work faster, and you are replying to the former question, which is indisputably true despite your skepticism, as if it were the latter. (It’s obvious that simply swapping summaries may be slower than regular Aumannian agreement: you could imagine that instead of taking a bunch of rounds to converge, one sends all its data to the other, the other recomputes, and sends the new result back and convergence is achieved.)