I think the judge should state eir honest opinion. To solve the problem of sparse feedback in the early phase, give the system access to more data than just win/lose from its own games. You can initialize it by training on human debates. Or, you can give it other input channels that will allow it to gradually build a sophisticated model of the world that includes the judge’s answer as a special case. For example, if you monitor humans for a long time you can start predicting human behavior, and the judge’s ruling is an instance of that.
I still have other problems with the honest strategy.
I’ve usually seen the truthful equilibrium (ie, the desired result of training) described as one where the first player always gives the real answer, and the second player has to lie. If the honest judge knows this, then this may interfere with how they give feedback. IE they may let the first player get away with a lot more due to their prior that the first player gave the right answer (e.g. my parody debate in the OP). This suggests that—under the honest judgement policy—perfect honesty (or 1-epsilon honesty for negligible epsilon) is not a stable equilibrium in some sense, there being no incentive preserving honesty. Past some point, the training signal gets worse as the strategies get “better” (better in the truth-telling direction).
If the signal is poor when debater strategies are very untruthful, and the signal is poor when debater strategies are very truthful, then the argument must be that the training signal is good for middling truthfulness. But that’s not clear to me, particularly for issues which require longer debates.
Does the honest strategy encourage truthfulness?
The way people reason about this seems to rest on two assumptions.
First, if a debater say something wrong, the other debater can challenge them to defend claims and sub-claims, eventually cornering them in an obvious falsehood (ie, one which the human can verify is false).
This depends on the cooperation of the dishonest player, giving justifications with a DAG structure which eventually ground out in verifiable/falsifiable claims. The dishonest player might instead give circular justifications with loop length greater than the debate length, or chains of justification that are unbounded, or use delaying tactics to try and push defeat off the end of the argument transcript, or refuse to give justifications at all. These strategies deprive the judge of information needed to make an informed decision.
Second, when the honest judge sees this, they decide in favor of the other player.
It’s natural to think whoever is caught in a lie loses. But being caught in a lie does not automatically mean your position was incorrect. The honest judge must take all information into account to try and determine who was correct. It seems to me that getting caught in a lie will not always be decisive, especially at an intermediate point in training where both AIs will sometimes be lying.
Does the honest strategy encourage justifications which ground out in verifiable/falsifiable statements?
If this were the case, it would support the claim that an honest judge encourages truthful debate strategies, since it’s a bullet point underneath that question. However, I already made some remarks there about why it might not be true.
In addition to those remarks, I note that the naive argument in favor would seem to be that such justifications help the honest judge by giving decisive evidence in support of a claim. A debater wants to do that if possible. However, the problem is that debate is supposed to allow justification trees which are larger than can possibly be explained to the human, but which make sense to a human at every step. The argument that debaters use such trees has to be more complex.
What the perfectly honest strategy actually does in any case is very complicated since, as Paul said in his answer, we don’t know exactly what you should infer upon seeing a debate.
I’ve usually seen the truthful equilibrium (ie, the desired result of training) described as one where the first player always gives the real answer, and the second player has to lie.
That seems weird, why would we do that? I always thought of it as: there is a yes/no question, agent 1 is arguing for “yes”, agent 2 is arguing for “no”.
However, the problem is that debate is supposed to allow justification trees which are larger than can possibly be explained to the human, but which make sense to a human at every step.
I didn’t realize you make this assumption. I agree that it makes things much more iffy (I’m somewhat skeptical about “factored cognition”). But, debate can be useful without this assumption also. We can imagine an AI answering questions for which the answer can be fully explained to a human, but it’s still superintelligent because it comes up with those answers much faster than a human or even all of humanity put together. In this case, I would still worry that scaled up indefinitely it can lead to AIs hacking humans in weird ways. But, plausibly there is a middle region (than we can access by quantilization?) where they are strong enough to be superhuman and to lie in “conventional” ways (which would be countered by the debate opponent), but too weak for weird hacking. And, in any case, combining this idea with other alignment mechanisms can lead to something useful (e.g. I suggested using it in Dialogic RL).
That seems weird, why would we do that? I always thought of it as: there is a yes/no question, agent 1 is arguing for “yes”, agent 2 is arguing for “no”.
Ah, well, that does make more sense for the case of binary (or even n-ary) questions. The version in the original paper was free-response.
If answers are pre-assigned like that, then my issues with the honest judging strategy are greatly reduced. However it’s no longer meaningful to speak of a truth-telling equilibrium, and instead the question seems to be whether false claims typically (convincingly) uncovered to be false given enough debate time.
I didn’t realize you make this assumption. I agree that it makes things much more iffy (I’m somewhat skeptical about “factored cognition”).
Yeah, I’ve heard (through the grapevine) that Paul and Geoffrey Irving think debate and factored cognition are tightly connected. It didn’t occur to me to try and disentangle them. I do feel a lot better about your version.
It harnesses the power of search to find arguments which convince humans but which humans couldn’t have found.
It harnesses the adversarial game to find counterarguments, as a safeguard against manipulative/misleading arguments.
It harnesses the same safeguard recursively, to prevent manipulative counterargument, counter-counterargument, etc. Under some assumptions about the effectiveness of the safeguard, this would ensure non-manipulation.
None of this requires anything about factored cognition, or arguments bigger than a human can understand. If one believed in factored cognition, some version of HCH could be used to judge the debates to enable that.
In the limit they seem equivalent: (i) it’s easy for HCH(with X minutes) to discover the equilibrium of a debate game where the judge has X minutes, (ii) a human with X minutes can judge a debate about what would be done by HCH(with X minutes).
The ML training strategies also seem extremely similar, in the sense that the difference between them is smaller than design choices within each of them, though that’s a more detailed discussion.
Ah, well, that does make more sense for the case of binary (or even n-ary) questions. The version in the original paper was free-response.
I’m still confused. Suppose the answers are free-form, and in the end the judge selects the answer ey assign a higher probability of truthfulness. If it’s a very close call (for example both answers are literally the same), ey flip a coin. Then, in equilibrium both agents should answer honestly, not so?
Another, possibly more elegant variant: The judge states eir subjective probability p1 that the first AI’s answer is correct, and eir subjective probability p2 that the second AI’s answer is correct. AI 1 gets reward p1−p2 and AI 2 gets reward p2−p1.
I’m still confused. Suppose the answers are free-form, and in the end the judge selects the answer ey assign a higher probability of truthfulness. If it’s a very close call (for example both answers are literally the same), ey flip a coin. Then, in equilibrium both agents should answer honestly, not so?
This is undesirable, because if both players give the same answer there is no training signal. We still want to search for better answers rather than allowing things to stall out early in training. So (barring other ways of mitigating this problem) we want to encourage players to give different answers. Therefore, rather than flipping a coin for close calls, ties can be decided in favor of player 1. This means player 2′s best bet is to select a plausible lie, if player 1 has already selected the best answer. That’s how I understood debate to work previous to the current discussion. But, as I’ve mentioned, this solution isn’t totally satisfactory. See here for my discussion of some other approaches to the problem.
...if both players give the same answer there is no training signal.
Why? If both players give the same answer, this only means their reward on this round is 0 out of [−1,+1]. But, there’s no reason the learning algorithm should be satisfied with this result, rather than continuing to explore strategies that might produce positive reward. However, it is true that in this case there is no incentive to poke holes in the opponent’s answer, so maybe they get less feedback from such a debate than from debates with different answers.
But, now that I think about it, the issue with biased judgement can surface even in a symmetric debate. As the AI converges towards giving good answers, the judge might get used to assigning high scores and stop scrutinizing the arguments. In a yes/no debate we don’t have this problem because the judge doesn’t know a priori which side is right. Scott’s suggestion to use different questions is interesting but doesn’t solve the biased judgement issue, I think.
How about the following variant of the “secret asymmetry” approach. We have 4 AIs: agents A1, A2, B1 and B2. In the beginning of each debate a coin is flipped and the result is visible to A1 and A2 but not to B1, B2 or the judge. This coin marks one of {A1, A2} as the “proponent” and the other as the “opponent”. On the first round A1 and A2 each generate an answer to the question, and don’t participate anymore. On the following rounds, B1 and B2 have a debate about the answers. In the end, the judge assigns probability p1 to A1′s answer and probability p2 to A2′s answer. The rewards work as follows:
If A1 is the proponent, it gets reward p1, and if it is the opponent, it gets reward −p2.
If A2 is the proponent, it gets reward p2, and if it is the opponent, it gets reward −p1.
B1 gets reward p1−p2.
B2 gets reward p2−p1.
If we assume B1 and B2 have access to each other’s internals, but not to the internals of A1 and A2, then they cannot prove to the judge which side is the proponent, so ostensibly the judge remains unbiased.
I think the judge should state eir honest opinion. To solve the problem of sparse feedback in the early phase, give the system access to more data than just win/lose from its own games. You can initialize it by training on human debates. Or, you can give it other input channels that will allow it to gradually build a sophisticated model of the world that includes the judge’s answer as a special case. For example, if you monitor humans for a long time you can start predicting human behavior, and the judge’s ruling is an instance of that.
I still have other problems with the honest strategy.
I’ve usually seen the truthful equilibrium (ie, the desired result of training) described as one where the first player always gives the real answer, and the second player has to lie. If the honest judge knows this, then this may interfere with how they give feedback. IE they may let the first player get away with a lot more due to their prior that the first player gave the right answer (e.g. my parody debate in the OP). This suggests that—under the honest judgement policy—perfect honesty (or 1-epsilon honesty for negligible epsilon) is not a stable equilibrium in some sense, there being no incentive preserving honesty. Past some point, the training signal gets worse as the strategies get “better” (better in the truth-telling direction).
If the signal is poor when debater strategies are very untruthful, and the signal is poor when debater strategies are very truthful, then the argument must be that the training signal is good for middling truthfulness. But that’s not clear to me, particularly for issues which require longer debates.
Does the honest strategy encourage truthfulness?
The way people reason about this seems to rest on two assumptions.
First, if a debater say something wrong, the other debater can challenge them to defend claims and sub-claims, eventually cornering them in an obvious falsehood (ie, one which the human can verify is false).
This depends on the cooperation of the dishonest player, giving justifications with a DAG structure which eventually ground out in verifiable/falsifiable claims. The dishonest player might instead give circular justifications with loop length greater than the debate length, or chains of justification that are unbounded, or use delaying tactics to try and push defeat off the end of the argument transcript, or refuse to give justifications at all. These strategies deprive the judge of information needed to make an informed decision.
Second, when the honest judge sees this, they decide in favor of the other player.
It’s natural to think whoever is caught in a lie loses. But being caught in a lie does not automatically mean your position was incorrect. The honest judge must take all information into account to try and determine who was correct. It seems to me that getting caught in a lie will not always be decisive, especially at an intermediate point in training where both AIs will sometimes be lying.
Does the honest strategy encourage justifications which ground out in verifiable/falsifiable statements?
If this were the case, it would support the claim that an honest judge encourages truthful debate strategies, since it’s a bullet point underneath that question. However, I already made some remarks there about why it might not be true.
In addition to those remarks, I note that the naive argument in favor would seem to be that such justifications help the honest judge by giving decisive evidence in support of a claim. A debater wants to do that if possible. However, the problem is that debate is supposed to allow justification trees which are larger than can possibly be explained to the human, but which make sense to a human at every step. The argument that debaters use such trees has to be more complex.
What the perfectly honest strategy actually does in any case is very complicated since, as Paul said in his answer, we don’t know exactly what you should infer upon seeing a debate.
That seems weird, why would we do that? I always thought of it as: there is a yes/no question, agent 1 is arguing for “yes”, agent 2 is arguing for “no”.
I didn’t realize you make this assumption. I agree that it makes things much more iffy (I’m somewhat skeptical about “factored cognition”). But, debate can be useful without this assumption also. We can imagine an AI answering questions for which the answer can be fully explained to a human, but it’s still superintelligent because it comes up with those answers much faster than a human or even all of humanity put together. In this case, I would still worry that scaled up indefinitely it can lead to AIs hacking humans in weird ways. But, plausibly there is a middle region (than we can access by quantilization?) where they are strong enough to be superhuman and to lie in “conventional” ways (which would be countered by the debate opponent), but too weak for weird hacking. And, in any case, combining this idea with other alignment mechanisms can lead to something useful (e.g. I suggested using it in Dialogic RL).
Ah, well, that does make more sense for the case of binary (or even n-ary) questions. The version in the original paper was free-response.
If answers are pre-assigned like that, then my issues with the honest judging strategy are greatly reduced. However it’s no longer meaningful to speak of a truth-telling equilibrium, and instead the question seems to be whether false claims typically (convincingly) uncovered to be false given enough debate time.
Yeah, I’ve heard (through the grapevine) that Paul and Geoffrey Irving think debate and factored cognition are tightly connected. It didn’t occur to me to try and disentangle them. I do feel a lot better about your version.
It harnesses the power of search to find arguments which convince humans but which humans couldn’t have found.
It harnesses the adversarial game to find counterarguments, as a safeguard against manipulative/misleading arguments.
It harnesses the same safeguard recursively, to prevent manipulative counterargument, counter-counterargument, etc. Under some assumptions about the effectiveness of the safeguard, this would ensure non-manipulation.
None of this requires anything about factored cognition, or arguments bigger than a human can understand. If one believed in factored cognition, some version of HCH could be used to judge the debates to enable that.
For reference, this is the topic of section 7 of AI Safety via Debate.
In the limit they seem equivalent: (i) it’s easy for HCH(with X minutes) to discover the equilibrium of a debate game where the judge has X minutes, (ii) a human with X minutes can judge a debate about what would be done by HCH(with X minutes).
The ML training strategies also seem extremely similar, in the sense that the difference between them is smaller than design choices within each of them, though that’s a more detailed discussion.
I’m still confused. Suppose the answers are free-form, and in the end the judge selects the answer ey assign a higher probability of truthfulness. If it’s a very close call (for example both answers are literally the same), ey flip a coin. Then, in equilibrium both agents should answer honestly, not so?
Another, possibly more elegant variant: The judge states eir subjective probability p1 that the first AI’s answer is correct, and eir subjective probability p2 that the second AI’s answer is correct. AI 1 gets reward p1−p2 and AI 2 gets reward p2−p1.
This is undesirable, because if both players give the same answer there is no training signal. We still want to search for better answers rather than allowing things to stall out early in training. So (barring other ways of mitigating this problem) we want to encourage players to give different answers. Therefore, rather than flipping a coin for close calls, ties can be decided in favor of player 1. This means player 2′s best bet is to select a plausible lie, if player 1 has already selected the best answer. That’s how I understood debate to work previous to the current discussion. But, as I’ve mentioned, this solution isn’t totally satisfactory. See here for my discussion of some other approaches to the problem.
Why? If both players give the same answer, this only means their reward on this round is 0 out of [−1,+1]. But, there’s no reason the learning algorithm should be satisfied with this result, rather than continuing to explore strategies that might produce positive reward. However, it is true that in this case there is no incentive to poke holes in the opponent’s answer, so maybe they get less feedback from such a debate than from debates with different answers.
But, now that I think about it, the issue with biased judgement can surface even in a symmetric debate. As the AI converges towards giving good answers, the judge might get used to assigning high scores and stop scrutinizing the arguments. In a yes/no debate we don’t have this problem because the judge doesn’t know a priori which side is right. Scott’s suggestion to use different questions is interesting but doesn’t solve the biased judgement issue, I think.
How about the following variant of the “secret asymmetry” approach. We have 4 AIs: agents A1, A2, B1 and B2. In the beginning of each debate a coin is flipped and the result is visible to A1 and A2 but not to B1, B2 or the judge. This coin marks one of {A1, A2} as the “proponent” and the other as the “opponent”. On the first round A1 and A2 each generate an answer to the question, and don’t participate anymore. On the following rounds, B1 and B2 have a debate about the answers. In the end, the judge assigns probability p1 to A1′s answer and probability p2 to A2′s answer. The rewards work as follows:
If A1 is the proponent, it gets reward p1, and if it is the opponent, it gets reward −p2.
If A2 is the proponent, it gets reward p2, and if it is the opponent, it gets reward −p1.
B1 gets reward p1−p2.
B2 gets reward p2−p1.
If we assume B1 and B2 have access to each other’s internals, but not to the internals of A1 and A2, then they cannot prove to the judge which side is the proponent, so ostensibly the judge remains unbiased.