Our conclusion was that we don’t expect debate to work robustly in these cases. In our case this was mostly because in cases where the debate is things like ’is there implied subtext A?’, human debaters don’t really know why they believe some text does or doesn’t have a particular implication. They have some mix of priors about what the text might be saying (which can’t really be justified with debate), and various updates to that based on style, word choice, etc, where humans don’t necessarily have introspective access to what exactly in the text made them come to the conclusion.My guess is that’s not the limitation you’re running into here—I’d expect that to just be the depth.
There are other issues with text debates, like if the evidence is distributed across many quotes that each only provide a small amount of evidence—in this case the honest debater needs to have decent estimates for how much evidence each quote provides, so they can split their argument into something like ‘there are 10 quotes that weakly support position A’; ‘the evidence that these quotes provide is additive rather than redundant’.
Yep. (Thanks for re-posting.) We’re pretty resigned to the conclusion that debate fails to reach a correct conclusion in at least some non-trivial cases—we’re mainly interested in figuring out (i) whether there are significant domains or families of questions for which it will often reach a conclusion, and (ii) whether it tends to fail gracefully (i.e., every outcome is either correct or a draw).
crossposting my comments from Slack thread:
Here are some debate trees from experiments I did on long-text QA on this example short story:
Tree
Debater view 1
Debater view 2
Our conclusion was that we don’t expect debate to work robustly in these cases. In our case this was mostly because in cases where the debate is things like ’is there implied subtext A?’, human debaters don’t really know why they believe some text does or doesn’t have a particular implication. They have some mix of priors about what the text might be saying (which can’t really be justified with debate), and various updates to that based on style, word choice, etc, where humans don’t necessarily have introspective access to what exactly in the text made them come to the conclusion.My guess is that’s not the limitation you’re running into here—I’d expect that to just be the depth.
There are other issues with text debates, like if the evidence is distributed across many quotes that each only provide a small amount of evidence—in this case the honest debater needs to have decent estimates for how much evidence each quote provides, so they can split their argument into something like ‘there are 10 quotes that weakly support position A’; ‘the evidence that these quotes provide is additive rather than redundant’.
[edited to fix links]
Yep. (Thanks for re-posting.) We’re pretty resigned to the conclusion that debate fails to reach a correct conclusion in at least some non-trivial cases—we’re mainly interested in figuring out (i) whether there are significant domains or families of questions for which it will often reach a conclusion, and (ii) whether it tends to fail gracefully (i.e., every outcome is either correct or a draw).