I’ve usually seen the truthful equilibrium (ie, the desired result of training) described as one where the first player always gives the real answer, and the second player has to lie.
That seems weird, why would we do that? I always thought of it as: there is a yes/no question, agent 1 is arguing for “yes”, agent 2 is arguing for “no”.
However, the problem is that debate is supposed to allow justification trees which are larger than can possibly be explained to the human, but which make sense to a human at every step.
I didn’t realize you make this assumption. I agree that it makes things much more iffy (I’m somewhat skeptical about “factored cognition”). But, debate can be useful without this assumption also. We can imagine an AI answering questions for which the answer can be fully explained to a human, but it’s still superintelligent because it comes up with those answers much faster than a human or even all of humanity put together. In this case, I would still worry that scaled up indefinitely it can lead to AIs hacking humans in weird ways. But, plausibly there is a middle region (than we can access by quantilization?) where they are strong enough to be superhuman and to lie in “conventional” ways (which would be countered by the debate opponent), but too weak for weird hacking. And, in any case, combining this idea with other alignment mechanisms can lead to something useful (e.g. I suggested using it in Dialogic RL).
That seems weird, why would we do that? I always thought of it as: there is a yes/no question, agent 1 is arguing for “yes”, agent 2 is arguing for “no”.
Ah, well, that does make more sense for the case of binary (or even n-ary) questions. The version in the original paper was free-response.
If answers are pre-assigned like that, then my issues with the honest judging strategy are greatly reduced. However it’s no longer meaningful to speak of a truth-telling equilibrium, and instead the question seems to be whether false claims typically (convincingly) uncovered to be false given enough debate time.
I didn’t realize you make this assumption. I agree that it makes things much more iffy (I’m somewhat skeptical about “factored cognition”).
Yeah, I’ve heard (through the grapevine) that Paul and Geoffrey Irving think debate and factored cognition are tightly connected. It didn’t occur to me to try and disentangle them. I do feel a lot better about your version.
It harnesses the power of search to find arguments which convince humans but which humans couldn’t have found.
It harnesses the adversarial game to find counterarguments, as a safeguard against manipulative/misleading arguments.
It harnesses the same safeguard recursively, to prevent manipulative counterargument, counter-counterargument, etc. Under some assumptions about the effectiveness of the safeguard, this would ensure non-manipulation.
None of this requires anything about factored cognition, or arguments bigger than a human can understand. If one believed in factored cognition, some version of HCH could be used to judge the debates to enable that.
In the limit they seem equivalent: (i) it’s easy for HCH(with X minutes) to discover the equilibrium of a debate game where the judge has X minutes, (ii) a human with X minutes can judge a debate about what would be done by HCH(with X minutes).
The ML training strategies also seem extremely similar, in the sense that the difference between them is smaller than design choices within each of them, though that’s a more detailed discussion.
Ah, well, that does make more sense for the case of binary (or even n-ary) questions. The version in the original paper was free-response.
I’m still confused. Suppose the answers are free-form, and in the end the judge selects the answer ey assign a higher probability of truthfulness. If it’s a very close call (for example both answers are literally the same), ey flip a coin. Then, in equilibrium both agents should answer honestly, not so?
Another, possibly more elegant variant: The judge states eir subjective probability p1 that the first AI’s answer is correct, and eir subjective probability p2 that the second AI’s answer is correct. AI 1 gets reward p1−p2 and AI 2 gets reward p2−p1.
I’m still confused. Suppose the answers are free-form, and in the end the judge selects the answer ey assign a higher probability of truthfulness. If it’s a very close call (for example both answers are literally the same), ey flip a coin. Then, in equilibrium both agents should answer honestly, not so?
This is undesirable, because if both players give the same answer there is no training signal. We still want to search for better answers rather than allowing things to stall out early in training. So (barring other ways of mitigating this problem) we want to encourage players to give different answers. Therefore, rather than flipping a coin for close calls, ties can be decided in favor of player 1. This means player 2′s best bet is to select a plausible lie, if player 1 has already selected the best answer. That’s how I understood debate to work previous to the current discussion. But, as I’ve mentioned, this solution isn’t totally satisfactory. See here for my discussion of some other approaches to the problem.
...if both players give the same answer there is no training signal.
Why? If both players give the same answer, this only means their reward on this round is 0 out of [−1,+1]. But, there’s no reason the learning algorithm should be satisfied with this result, rather than continuing to explore strategies that might produce positive reward. However, it is true that in this case there is no incentive to poke holes in the opponent’s answer, so maybe they get less feedback from such a debate than from debates with different answers.
But, now that I think about it, the issue with biased judgement can surface even in a symmetric debate. As the AI converges towards giving good answers, the judge might get used to assigning high scores and stop scrutinizing the arguments. In a yes/no debate we don’t have this problem because the judge doesn’t know a priori which side is right. Scott’s suggestion to use different questions is interesting but doesn’t solve the biased judgement issue, I think.
How about the following variant of the “secret asymmetry” approach. We have 4 AIs: agents A1, A2, B1 and B2. In the beginning of each debate a coin is flipped and the result is visible to A1 and A2 but not to B1, B2 or the judge. This coin marks one of {A1, A2} as the “proponent” and the other as the “opponent”. On the first round A1 and A2 each generate an answer to the question, and don’t participate anymore. On the following rounds, B1 and B2 have a debate about the answers. In the end, the judge assigns probability p1 to A1′s answer and probability p2 to A2′s answer. The rewards work as follows:
If A1 is the proponent, it gets reward p1, and if it is the opponent, it gets reward −p2.
If A2 is the proponent, it gets reward p2, and if it is the opponent, it gets reward −p1.
B1 gets reward p1−p2.
B2 gets reward p2−p1.
If we assume B1 and B2 have access to each other’s internals, but not to the internals of A1 and A2, then they cannot prove to the judge which side is the proponent, so ostensibly the judge remains unbiased.
That seems weird, why would we do that? I always thought of it as: there is a yes/no question, agent 1 is arguing for “yes”, agent 2 is arguing for “no”.
I didn’t realize you make this assumption. I agree that it makes things much more iffy (I’m somewhat skeptical about “factored cognition”). But, debate can be useful without this assumption also. We can imagine an AI answering questions for which the answer can be fully explained to a human, but it’s still superintelligent because it comes up with those answers much faster than a human or even all of humanity put together. In this case, I would still worry that scaled up indefinitely it can lead to AIs hacking humans in weird ways. But, plausibly there is a middle region (than we can access by quantilization?) where they are strong enough to be superhuman and to lie in “conventional” ways (which would be countered by the debate opponent), but too weak for weird hacking. And, in any case, combining this idea with other alignment mechanisms can lead to something useful (e.g. I suggested using it in Dialogic RL).
Ah, well, that does make more sense for the case of binary (or even n-ary) questions. The version in the original paper was free-response.
If answers are pre-assigned like that, then my issues with the honest judging strategy are greatly reduced. However it’s no longer meaningful to speak of a truth-telling equilibrium, and instead the question seems to be whether false claims typically (convincingly) uncovered to be false given enough debate time.
Yeah, I’ve heard (through the grapevine) that Paul and Geoffrey Irving think debate and factored cognition are tightly connected. It didn’t occur to me to try and disentangle them. I do feel a lot better about your version.
It harnesses the power of search to find arguments which convince humans but which humans couldn’t have found.
It harnesses the adversarial game to find counterarguments, as a safeguard against manipulative/misleading arguments.
It harnesses the same safeguard recursively, to prevent manipulative counterargument, counter-counterargument, etc. Under some assumptions about the effectiveness of the safeguard, this would ensure non-manipulation.
None of this requires anything about factored cognition, or arguments bigger than a human can understand. If one believed in factored cognition, some version of HCH could be used to judge the debates to enable that.
For reference, this is the topic of section 7 of AI Safety via Debate.
In the limit they seem equivalent: (i) it’s easy for HCH(with X minutes) to discover the equilibrium of a debate game where the judge has X minutes, (ii) a human with X minutes can judge a debate about what would be done by HCH(with X minutes).
The ML training strategies also seem extremely similar, in the sense that the difference between them is smaller than design choices within each of them, though that’s a more detailed discussion.
I’m still confused. Suppose the answers are free-form, and in the end the judge selects the answer ey assign a higher probability of truthfulness. If it’s a very close call (for example both answers are literally the same), ey flip a coin. Then, in equilibrium both agents should answer honestly, not so?
Another, possibly more elegant variant: The judge states eir subjective probability p1 that the first AI’s answer is correct, and eir subjective probability p2 that the second AI’s answer is correct. AI 1 gets reward p1−p2 and AI 2 gets reward p2−p1.
This is undesirable, because if both players give the same answer there is no training signal. We still want to search for better answers rather than allowing things to stall out early in training. So (barring other ways of mitigating this problem) we want to encourage players to give different answers. Therefore, rather than flipping a coin for close calls, ties can be decided in favor of player 1. This means player 2′s best bet is to select a plausible lie, if player 1 has already selected the best answer. That’s how I understood debate to work previous to the current discussion. But, as I’ve mentioned, this solution isn’t totally satisfactory. See here for my discussion of some other approaches to the problem.
Why? If both players give the same answer, this only means their reward on this round is 0 out of [−1,+1]. But, there’s no reason the learning algorithm should be satisfied with this result, rather than continuing to explore strategies that might produce positive reward. However, it is true that in this case there is no incentive to poke holes in the opponent’s answer, so maybe they get less feedback from such a debate than from debates with different answers.
But, now that I think about it, the issue with biased judgement can surface even in a symmetric debate. As the AI converges towards giving good answers, the judge might get used to assigning high scores and stop scrutinizing the arguments. In a yes/no debate we don’t have this problem because the judge doesn’t know a priori which side is right. Scott’s suggestion to use different questions is interesting but doesn’t solve the biased judgement issue, I think.
How about the following variant of the “secret asymmetry” approach. We have 4 AIs: agents A1, A2, B1 and B2. In the beginning of each debate a coin is flipped and the result is visible to A1 and A2 but not to B1, B2 or the judge. This coin marks one of {A1, A2} as the “proponent” and the other as the “opponent”. On the first round A1 and A2 each generate an answer to the question, and don’t participate anymore. On the following rounds, B1 and B2 have a debate about the answers. In the end, the judge assigns probability p1 to A1′s answer and probability p2 to A2′s answer. The rewards work as follows:
If A1 is the proponent, it gets reward p1, and if it is the opponent, it gets reward −p2.
If A2 is the proponent, it gets reward p2, and if it is the opponent, it gets reward −p1.
B1 gets reward p1−p2.
B2 gets reward p2−p1.
If we assume B1 and B2 have access to each other’s internals, but not to the internals of A1 and A2, then they cannot prove to the judge which side is the proponent, so ostensibly the judge remains unbiased.