Another author here! Regarding specifically the 74% vs. 84% numbers—a key takeaway that our error analysis is intended to communicate is that we think a large fraction of the errors judges made in debates were pretty easily solvable with more careful judges, whereas this didn’t feel like it was the case with consultancy.
For example, Julian and I both had 100% accuracy as judges on human debates for the 36 human debates we judged, which was ~20% of all correct human debate judgments. So I’d guess that more careful judges overall could increase debate accuracy to at least 90%, maybe higher, although at that point we start hitting measurement limits from the questions themselves being noisy.
Another author here! Regarding specifically the 74% vs. 84% numbers—a key takeaway that our error analysis is intended to communicate is that we think a large fraction of the errors judges made in debates were pretty easily solvable with more careful judges, whereas this didn’t feel like it was the case with consultancy.
For example, Julian and I both had 100% accuracy as judges on human debates for the 36 human debates we judged, which was ~20% of all correct human debate judgments. So I’d guess that more careful judges overall could increase debate accuracy to at least 90%, maybe higher, although at that point we start hitting measurement limits from the questions themselves being noisy.