I was referring to what actually happens in a programme committee meeting, not the Delphi method.
Fine. Then consider it an example of ‘loony’ behavior in the real world: Delphi pools, as a matter of fact, for many decades, have operated by exchanging probabilities and updating repeatedly, and in a number of cases performed well (justifying their continued usage). You don’t like Delphi pools? That’s cool too, I’ll just switch my example to prediction markets.
It would be interesting to conduct an experiment to compare the two methods for this problem. However, it is not clear how to obtain a ground truth with which to judge the correctness of the results. BTW, my further elaboration, with the example of one referee knowing that the paper under discussion was already published, was also non-fictional. It is not clear to me how any decision method that does not allow for sharing of evidence can yield the right answer for this example.
What have Delphi methods been found to perform well relative to, and for what sorts of problems?
However, it is not clear how to obtain a ground truth with which to judge the correctness of the results.
That assumes we don’t have any criteria on which to judge good versus bad scientific papers.
You could train your model to predict the amount of citations that a paper will get.
You can also look at variables such as reproduced papers or withdrawn papers.
Define a utility function that collapses such variables into a single one.
Run a real world experiment in a journal and do 50% of the paper submissions with one mechanism and 50% with the other. Let a few years go by and then you evaluate the techniques based on your utility function.
You could train your model to predict the amount of citations that a paper will get. You can also look at variables such as reproduced papers or withdrawn papers.
Define a utility function that collapses such variables into a single one. Run a real world experiment in a journal and do 50% of the paper submissions with one mechanism and 50% with the other. Let a few years go by and then you evaluate the techniques based on your utility function.
Something along those lines might be done, but an interventional experiment (creating journals just to test a hypothesis about refereeing) would be impractical. That leaves observational data-collecting, where one might compare the differing practices of existing journals. But the confounding problems would be substantial.
Or, more promisingly, you could do an experiment with papers that are already published and have a citation record, and have experimental groups of referees assess them, and test different methods of resolving disagreements. That might actually be worth doing, although it has the flaw that it would only be assessing accepted papers and not the full range of submissions.
However, it is not clear how to obtain a ground truth with which to judge the correctness of the results.
It is if you take 5 seconds to think about it and compare it to any prediction market, calibration exercise, forecasting competition, betting company, or general market: finance, geo-political events, sporting events, almanac items. Ground-truths aren’t exactly hard to come by.
What have Delphi methods been found to perform well relative to, and for what sorts of problems?
I already mentioned a review paper. It’s strange you aren’t already familiar with the strengths and weaknesses of decision & forecasting methods which involve people communicating only summaries of their beliefs to reach highly accurate results, given how loony you think these methods are and how certain of this you are.
It is if you take 5 seconds to think about it. Finance. Geo-political events. Sporting events. Almanac items.
Sorry, I was still talking (“this problem”) about the example I introduced.
I already mentioned a review paper.
Which recommends sharing “average estimates plus justifications” and “provide the mean or median estimate of the panel plus the rationales from all panellists”. They found that providing reasons was better than only statistics of the judgements (see paragraph following your second quote). As happens in the programme committee. The main difference from Delphi is that the former is not structured into rounds in the same way. The referees send in their judgements, then the committee (a smaller subset of the referees) decides.
Sorry, I was still talking (“this problem”) about the example I introduced.
If the method works on other problems, that seems like good evidence it works on your specific conference paper problem, no?
Which recommends sharing “average estimates plus justifications” and “provide the mean or median estimate of the panel plus the rationales from all panellists”.
Indeed, it does—but it says that it works better than purely statistical feedback. More information is often better. But why is that relevant? You are moving the goalposts; earlier you asked:
It is not clear to me how any decision method that does not allow for sharing of evidence can yield the right answer for this example.
I brought up prediction markets and Delphi pools because they are mechanisms which function very similarly to Aumann agreement in sharing summaries rather than evidence, and yet they work. Whether they work is not the same question as whether there is anything which could work faster, and you are replying to the former question, which is indisputably true despite your skepticism, as if it were the latter. (It’s obvious that simply swapping summaries may be slower than regular Aumannian agreement: you could imagine that instead of taking a bunch of rounds to converge, one sends all its data to the other, the other recomputes, and sends the new result back and convergence is achieved.)
I was referring to what actually happens in a programme committee meeting, not the Delphi method.
Fine. Then consider it an example of ‘loony’ behavior in the real world: Delphi pools, as a matter of fact, for many decades, have operated by exchanging probabilities and updating repeatedly, and in a number of cases performed well (justifying their continued usage). You don’t like Delphi pools? That’s cool too, I’ll just switch my example to prediction markets.
It would be interesting to conduct an experiment to compare the two methods for this problem. However, it is not clear how to obtain a ground truth with which to judge the correctness of the results. BTW, my further elaboration, with the example of one referee knowing that the paper under discussion was already published, was also non-fictional. It is not clear to me how any decision method that does not allow for sharing of evidence can yield the right answer for this example.
What have Delphi methods been found to perform well relative to, and for what sorts of problems?
That assumes we don’t have any criteria on which to judge good versus bad scientific papers.
You could train your model to predict the amount of citations that a paper will get. You can also look at variables such as reproduced papers or withdrawn papers.
Define a utility function that collapses such variables into a single one. Run a real world experiment in a journal and do 50% of the paper submissions with one mechanism and 50% with the other. Let a few years go by and then you evaluate the techniques based on your utility function.
Something along those lines might be done, but an interventional experiment (creating journals just to test a hypothesis about refereeing) would be impractical. That leaves observational data-collecting, where one might compare the differing practices of existing journals. But the confounding problems would be substantial.
Or, more promisingly, you could do an experiment with papers that are already published and have a citation record, and have experimental groups of referees assess them, and test different methods of resolving disagreements. That might actually be worth doing, although it has the flaw that it would only be assessing accepted papers and not the full range of submissions.
Then no reason why you can’t test different procedures in an existing journal.
It is if you take 5 seconds to think about it and compare it to any prediction market, calibration exercise, forecasting competition, betting company, or general market: finance, geo-political events, sporting events, almanac items. Ground-truths aren’t exactly hard to come by.
I already mentioned a review paper. It’s strange you aren’t already familiar with the strengths and weaknesses of decision & forecasting methods which involve people communicating only summaries of their beliefs to reach highly accurate results, given how loony you think these methods are and how certain of this you are.
Sorry, I was still talking (“this problem”) about the example I introduced.
Which recommends sharing “average estimates plus justifications” and “provide the mean or median estimate of the panel plus the rationales from all panellists”. They found that providing reasons was better than only statistics of the judgements (see paragraph following your second quote). As happens in the programme committee. The main difference from Delphi is that the former is not structured into rounds in the same way. The referees send in their judgements, then the committee (a smaller subset of the referees) decides.
None of this is Aumann sharing of posteriors.
If the method works on other problems, that seems like good evidence it works on your specific conference paper problem, no?
Indeed, it does—but it says that it works better than purely statistical feedback. More information is often better. But why is that relevant? You are moving the goalposts; earlier you asked:
I brought up prediction markets and Delphi pools because they are mechanisms which function very similarly to Aumann agreement in sharing summaries rather than evidence, and yet they work. Whether they work is not the same question as whether there is anything which could work faster, and you are replying to the former question, which is indisputably true despite your skepticism, as if it were the latter. (It’s obvious that simply swapping summaries may be slower than regular Aumannian agreement: you could imagine that instead of taking a bunch of rounds to converge, one sends all its data to the other, the other recomputes, and sends the new result back and convergence is achieved.)