I’m still confused. Suppose the answers are free-form, and in the end the judge selects the answer ey assign a higher probability of truthfulness. If it’s a very close call (for example both answers are literally the same), ey flip a coin. Then, in equilibrium both agents should answer honestly, not so?
This is undesirable, because if both players give the same answer there is no training signal. We still want to search for better answers rather than allowing things to stall out early in training. So (barring other ways of mitigating this problem) we want to encourage players to give different answers. Therefore, rather than flipping a coin for close calls, ties can be decided in favor of player 1. This means player 2′s best bet is to select a plausible lie, if player 1 has already selected the best answer. That’s how I understood debate to work previous to the current discussion. But, as I’ve mentioned, this solution isn’t totally satisfactory. See here for my discussion of some other approaches to the problem.
...if both players give the same answer there is no training signal.
Why? If both players give the same answer, this only means their reward on this round is 0 out of [−1,+1]. But, there’s no reason the learning algorithm should be satisfied with this result, rather than continuing to explore strategies that might produce positive reward. However, it is true that in this case there is no incentive to poke holes in the opponent’s answer, so maybe they get less feedback from such a debate than from debates with different answers.
But, now that I think about it, the issue with biased judgement can surface even in a symmetric debate. As the AI converges towards giving good answers, the judge might get used to assigning high scores and stop scrutinizing the arguments. In a yes/no debate we don’t have this problem because the judge doesn’t know a priori which side is right. Scott’s suggestion to use different questions is interesting but doesn’t solve the biased judgement issue, I think.
How about the following variant of the “secret asymmetry” approach. We have 4 AIs: agents A1, A2, B1 and B2. In the beginning of each debate a coin is flipped and the result is visible to A1 and A2 but not to B1, B2 or the judge. This coin marks one of {A1, A2} as the “proponent” and the other as the “opponent”. On the first round A1 and A2 each generate an answer to the question, and don’t participate anymore. On the following rounds, B1 and B2 have a debate about the answers. In the end, the judge assigns probability p1 to A1′s answer and probability p2 to A2′s answer. The rewards work as follows:
If A1 is the proponent, it gets reward p1, and if it is the opponent, it gets reward −p2.
If A2 is the proponent, it gets reward p2, and if it is the opponent, it gets reward −p1.
B1 gets reward p1−p2.
B2 gets reward p2−p1.
If we assume B1 and B2 have access to each other’s internals, but not to the internals of A1 and A2, then they cannot prove to the judge which side is the proponent, so ostensibly the judge remains unbiased.
This is undesirable, because if both players give the same answer there is no training signal. We still want to search for better answers rather than allowing things to stall out early in training. So (barring other ways of mitigating this problem) we want to encourage players to give different answers. Therefore, rather than flipping a coin for close calls, ties can be decided in favor of player 1. This means player 2′s best bet is to select a plausible lie, if player 1 has already selected the best answer. That’s how I understood debate to work previous to the current discussion. But, as I’ve mentioned, this solution isn’t totally satisfactory. See here for my discussion of some other approaches to the problem.
Why? If both players give the same answer, this only means their reward on this round is 0 out of [−1,+1]. But, there’s no reason the learning algorithm should be satisfied with this result, rather than continuing to explore strategies that might produce positive reward. However, it is true that in this case there is no incentive to poke holes in the opponent’s answer, so maybe they get less feedback from such a debate than from debates with different answers.
But, now that I think about it, the issue with biased judgement can surface even in a symmetric debate. As the AI converges towards giving good answers, the judge might get used to assigning high scores and stop scrutinizing the arguments. In a yes/no debate we don’t have this problem because the judge doesn’t know a priori which side is right. Scott’s suggestion to use different questions is interesting but doesn’t solve the biased judgement issue, I think.
How about the following variant of the “secret asymmetry” approach. We have 4 AIs: agents A1, A2, B1 and B2. In the beginning of each debate a coin is flipped and the result is visible to A1 and A2 but not to B1, B2 or the judge. This coin marks one of {A1, A2} as the “proponent” and the other as the “opponent”. On the first round A1 and A2 each generate an answer to the question, and don’t participate anymore. On the following rounds, B1 and B2 have a debate about the answers. In the end, the judge assigns probability p1 to A1′s answer and probability p2 to A2′s answer. The rewards work as follows:
If A1 is the proponent, it gets reward p1, and if it is the opponent, it gets reward −p2.
If A2 is the proponent, it gets reward p2, and if it is the opponent, it gets reward −p1.
B1 gets reward p1−p2.
B2 gets reward p2−p1.
If we assume B1 and B2 have access to each other’s internals, but not to the internals of A1 and A2, then they cannot prove to the judge which side is the proponent, so ostensibly the judge remains unbiased.