The standard argument against having a non-zero-sum debate game is that then you may incentivise your debaters to collude.
I don’t know if you’ve seen our most recent debate rules and attempt at analysis of whether they provide the desired behavior—seems somewhat relevant to what you’re thinking about here.
I don’t know if you’ve seen our most recent debate rules and attempt at analysis of whether they provide the desired behavior—seems somewhat relevant to what you’re thinking about here.
I took a look, and it was indeed helpful. However, I left a comment there about a concern I have. The argument at the end only argues for what you call D-acceptability: having no answer that’s judged better after D steps of debate. My concern is that even if debaters are always D-acceptable for all D, that does not mean they are honest. They can instead use non-well-founded argument trees which never bottom out.
That is a concern, but only in the case where there’s no answer that has an argument tree that bottoms out in depth<D. As long as there exists an answer that is supported by a depth<D tree, this answer will beat the answers only supported by depth>D argument trees.
So there is a case where the debaters are not incentivised to be honest; the case where the debaters know something but there’s no human-understandable argument for it that bottoms out in <D steps. This is where we get the PSPACE constraint.
If we include discussion of cross-examination (which the analysis there did not include), then we can get rid of this constraint: each debater commits to an argument tree, then each debater points out the weakest node in the tree (or points out that some part of the tree doesn’t bottom out).
(we can only handle really large trees if we assume debaters are computationally unbounded in general though. If we don’t assume this, even if we still assume they have oracles for some specific problems, we still probably can’t supervise anything that’s not in NP, because of the obfuscated argument problem)
I think the collusion concern basically over-anthropomorphizes the training process. Say, in prisoner’s dilemma, if you train myopically, then “all incentives point toward defection” translates concretely to actual defection.
Granted, there are training regimes in which this doesn’t happen, but those would have to be avoided.
OTOH, the concern might be that an inner optimizer would develop which colludes. This would have to be dealt with by more general anti-inner-optimizer technology.
I don’t know if you’ve seen our most recent debate rules and attempt at analysis of whether they provide the desired behavior—seems somewhat relevant to what you’re thinking about here.
The standard argument against having a non-zero-sum debate game is that then you may incentivise your debaters to collude.
I don’t know if you’ve seen our most recent debate rules and attempt at analysis of whether they provide the desired behavior—seems somewhat relevant to what you’re thinking about here.
I took a look, and it was indeed helpful. However, I left a comment there about a concern I have. The argument at the end only argues for what you call D-acceptability: having no answer that’s judged better after D steps of debate. My concern is that even if debaters are always D-acceptable for all D, that does not mean they are honest. They can instead use non-well-founded argument trees which never bottom out.
That is a concern, but only in the case where there’s no answer that has an argument tree that bottoms out in depth<D. As long as there exists an answer that is supported by a depth<D tree, this answer will beat the answers only supported by depth>D argument trees.
So there is a case where the debaters are not incentivised to be honest; the case where the debaters know something but there’s no human-understandable argument for it that bottoms out in <D steps. This is where we get the PSPACE constraint.
If we include discussion of cross-examination (which the analysis there did not include), then we can get rid of this constraint: each debater commits to an argument tree, then each debater points out the weakest node in the tree (or points out that some part of the tree doesn’t bottom out).
(we can only handle really large trees if we assume debaters are computationally unbounded in general though. If we don’t assume this, even if we still assume they have oracles for some specific problems, we still probably can’t supervise anything that’s not in NP, because of the obfuscated argument problem)
I think the collusion concern basically over-anthropomorphizes the training process. Say, in prisoner’s dilemma, if you train myopically, then “all incentives point toward defection” translates concretely to actual defection.
Granted, there are training regimes in which this doesn’t happen, but those would have to be avoided.
OTOH, the concern might be that an inner optimizer would develop which colludes. This would have to be dealt with by more general anti-inner-optimizer technology.
Yep, I should take a look!