Sorry for not understanding how much context was missing here.
The right starting point for your question is this writeup which describes the state of debate experiments at OpenAI as of end-of-2019 including the rules we were using at that time. Those rules are a work in progress but I think they are good enough for the purpose of this discussion.
In those rules: If we are running a depth-T+1 debate about X and we encounter a disagreement about Y, then we start a depth-T debate about Y and judge exclusively based on that. We totally ignore the disagreement about X.
Our current rules—to hopefully be published sometime this quarter—handle recursion in a slightly more nuanced way. In the current rules, after debating Y we should return to the original debate. We allow the debaters to make a new set of arguments, and it may be that one debater now realizes they should concede, but it’s important that a debater who had previously made an untenable claim about X will eventually pay a penalty for doing so (in addition to whatever payoff they receive in the debate about Y). I don’t expect this paragraph to be clear and don’t think it’s worth getting into until we publish an update, but wanted to flag it.
Do the debaters know how long the debate is going to be?
Yes.
To what extent are you trying to claim some relationship between the judge strategy you’re describing and the honest one? EG, that it’s eventually close to honest judging? (I’m asking whether this seems like an important question for the discussion vs one which should be set aside.)
If debate works, then at equilibrium the judge will always be favoring the better answer. If furthermore the judge believes that debate works, then this will also be their honest belief. So if judges believe in debate then it looks to me like the judging strategy must eventually approximate honest judging. But this is downstream of debate working, it doesn’t play an important role in the argumetn that debate works or anything like that.
Yep, that document was what I needed to see. I wouldn’t say all my confusions are resolved, but I need to think more carefully about what’s in there. Thanks!
It seems the symmetry concerns of that document are quite different from the concerns I was voicing. The symmetry concerns in the document are, iiuc,
The debate goes well if the honest player expounds an argument, and the dishonest player critiques that argument. However, the debate goes poorly if those roles end up reversed. Therefore we force both players to do both.
OTOH, my symmetry concerns can be summarized as follows:
If player 2 chooses an answer after player 1 (getting access to player 1′s answer in order to select a different one), then assuming competent play, player 1′s answer will almost always be the better one. This prior taints the judge’s decision in a way which seems to seriously reduce the training signal and threaten the desired equilibrium.
If the two players choose simultaneously, then it’s hard to see how to discourage them from selecting the same answer. This seems likely at late stages due to convergence, and also likely at early stages due to the fact that both players actually use the same NN. This again seriously reduces the training signal.
I now believe that this concern can be addressed, although it seems a bit fiddly, and the mechanism which I currently believe addresses the problem is somewhat complex.
Known Debate Length
I’m a bit confused why you would make the debate length known to the debaters. This seems to allow them to make indefensible statements at the very end of a debate, secure in the knowledge that they can’t be critiqued. One step before the end, they can make statements which can’t be convincingly critiqued in one step. And so on.
Instead, it seems like you’d want the debate to end randomly, according to a memoryless distribution. This way, the expected future debate length is the same at all times, meaning that any statement made at any point is facing the same expected demand of defensibility.
Factored Cognition
I currently think all my concerns can be addressed if we abandon the link to factored cognition and defend a less ambitious thesis about debate. The feb 2020 proposal does touch on some of my concerns there, by enforcing a good argumentative structure, rather than allowing the debate to spiral out of control (due to e.g. delaying tactics).
However, my overall position is still one of skepticism wrt the link to factored cognition. The most salient reason for me ATM is the concern that debaters needn’t structure their arguments as DAGs which ground out in human-verifiable premises, but rather, can make large circular arguments (too large for the debate structure to catch) or unbounded argument chains (or simply very very high depth argument trees, which contain a flaw at a point far too deep for debate to find).
ETA: Having now read more of the feb 2020 report, I see that very similar concerns are expressed near the end—the long computation problem seems pretty similar to what I’m pointing at.
I’m a bit confused why you would make the debate length known to the debaters. This seems to allow them to make indefensible statements at the very end of a debate, secure in the knowledge that they can’t be critiqued. One step before the end, they can make statements which can’t be convincingly critiqued in one step. And so on.
[...]
The most salient reason for me ATM is the concern that debaters needn’t structure their arguments as DAGs which ground out in human-verifiable premises, but rather, can make large circular arguments (too large for the debate structure to catch) or unbounded argument chains (or simply very very high depth argument trees, which contain a flaw at a point far too deep for debate to find).
If I assert “X because Y & Z” and the depth limit is 0, you aren’t intended to say “Yup, checks out,” unless Y and Z and the implication are self-evident to you. Low-depth debates are supposed to ground out with the judge’s priors / low-confidence in things that aren’t easy to establish directly (because if I’m only updating on “Y looks plausible in a very low-depth debate” then I’m going to say “I don’t know but I suspect X” is a better answer than “definitely X”). That seems like a consequence of the norms in my original answer.
In this context, a circular argument just isn’t very appealing. At the bottom you are going to be very uncertain, and all that uncertainty is going to propagate all the way up.
Instead, it seems like you’d want the debate to end randomly, according to a memoryless distribution. This way, the expected future debate length is the same at all times, meaning that any statement made at any point is facing the same expected demand of defensibility.
If you do it this way the debate really doesn’t seem to work, as you point out.
For my part I mostly care about the ambitious thesis.
If the two players choose simultaneously, then it’s hard to see how to discourage them from selecting the same answer. This seems likely at late stages due to convergence, and also likely at early stages due to the fact that both players actually use the same NN. This again seriously reduces the training signal.
If player 2 chooses an answer after player 1 (getting access to player 1′s answer in order to select a different one), then assuming competent play, player 1′s answer will almost always be the better one. This prior taints the judge’s decision in a way which seems to seriously reduce the training signal and threaten the desired equilibrium.
I disagree with both of these as objections to the basic strategy, but don’t think they are very important.
Sorry for not understanding how much context was missing here.
The right starting point for your question is this writeup which describes the state of debate experiments at OpenAI as of end-of-2019 including the rules we were using at that time. Those rules are a work in progress but I think they are good enough for the purpose of this discussion.
In those rules: If we are running a depth-T+1 debate about X and we encounter a disagreement about Y, then we start a depth-T debate about Y and judge exclusively based on that. We totally ignore the disagreement about X.
Our current rules—to hopefully be published sometime this quarter—handle recursion in a slightly more nuanced way. In the current rules, after debating Y we should return to the original debate. We allow the debaters to make a new set of arguments, and it may be that one debater now realizes they should concede, but it’s important that a debater who had previously made an untenable claim about X will eventually pay a penalty for doing so (in addition to whatever payoff they receive in the debate about Y). I don’t expect this paragraph to be clear and don’t think it’s worth getting into until we publish an update, but wanted to flag it.
Yes.
If debate works, then at equilibrium the judge will always be favoring the better answer. If furthermore the judge believes that debate works, then this will also be their honest belief. So if judges believe in debate then it looks to me like the judging strategy must eventually approximate honest judging. But this is downstream of debate working, it doesn’t play an important role in the argumetn that debate works or anything like that.
Yep, that document was what I needed to see. I wouldn’t say all my confusions are resolved, but I need to think more carefully about what’s in there. Thanks!
Symmetry Concerns
It seems the symmetry concerns of that document are quite different from the concerns I was voicing. The symmetry concerns in the document are, iiuc,
The debate goes well if the honest player expounds an argument, and the dishonest player critiques that argument. However, the debate goes poorly if those roles end up reversed. Therefore we force both players to do both.
OTOH, my symmetry concerns can be summarized as follows:
If player 2 chooses an answer after player 1 (getting access to player 1′s answer in order to select a different one), then assuming competent play, player 1′s answer will almost always be the better one. This prior taints the judge’s decision in a way which seems to seriously reduce the training signal and threaten the desired equilibrium.
If the two players choose simultaneously, then it’s hard to see how to discourage them from selecting the same answer. This seems likely at late stages due to convergence, and also likely at early stages due to the fact that both players actually use the same NN. This again seriously reduces the training signal.
I now believe that this concern can be addressed, although it seems a bit fiddly, and the mechanism which I currently believe addresses the problem is somewhat complex.
Known Debate Length
I’m a bit confused why you would make the debate length known to the debaters. This seems to allow them to make indefensible statements at the very end of a debate, secure in the knowledge that they can’t be critiqued. One step before the end, they can make statements which can’t be convincingly critiqued in one step. And so on.
Instead, it seems like you’d want the debate to end randomly, according to a memoryless distribution. This way, the expected future debate length is the same at all times, meaning that any statement made at any point is facing the same expected demand of defensibility.
Factored Cognition
I currently think all my concerns can be addressed if we abandon the link to factored cognition and defend a less ambitious thesis about debate. The feb 2020 proposal does touch on some of my concerns there, by enforcing a good argumentative structure, rather than allowing the debate to spiral out of control (due to e.g. delaying tactics).
However, my overall position is still one of skepticism wrt the link to factored cognition. The most salient reason for me ATM is the concern that debaters needn’t structure their arguments as DAGs which ground out in human-verifiable premises, but rather, can make large circular arguments (too large for the debate structure to catch) or unbounded argument chains (or simply very very high depth argument trees, which contain a flaw at a point far too deep for debate to find).
ETA: Having now read more of the feb 2020 report, I see that very similar concerns are expressed near the end—the long computation problem seems pretty similar to what I’m pointing at.
If I assert “X because Y & Z” and the depth limit is 0, you aren’t intended to say “Yup, checks out,” unless Y and Z and the implication are self-evident to you. Low-depth debates are supposed to ground out with the judge’s priors / low-confidence in things that aren’t easy to establish directly (because if I’m only updating on “Y looks plausible in a very low-depth debate” then I’m going to say “I don’t know but I suspect X” is a better answer than “definitely X”). That seems like a consequence of the norms in my original answer.
In this context, a circular argument just isn’t very appealing. At the bottom you are going to be very uncertain, and all that uncertainty is going to propagate all the way up.
If you do it this way the debate really doesn’t seem to work, as you point out.
For my part I mostly care about the ambitious thesis.
I disagree with both of these as objections to the basic strategy, but don’t think they are very important.