This post reports on work done on creating a <@debate@>(@AI safety via debate@) setup that works well with human players. In the game, one player is honest (i.e. arguing for the correct answer) and one is malicious (i.e. arguing for some worse answer), and they play a debate in some format, after which a judge must decide which player won the debate. They are using Thinking Physics questions for these debates, because they involve questions with clear answers that are confusing to most people (the judges) but easy for some experts (the players).
Early freeform text debates did not work very well, even with smart, motivated judges. The malicious player could deflect on questions they didn’t want to answer, e.g. by claiming that the question was ambiguous and redirecting attention by asking new questions. In addition, when the malicious player got to go first and give an incorrect “framework” for finding the answer, and then made individually true claims to “fill in” the framework, it was hard for the honest player to rebut it. So, they moved to a framework without such asymmetries: both players gave a claim (simultaneously), both gave constructive arguments, and both rebutted the other’s arguments. In addition, part of the appeal of debate is that the agents can “zoom in” on the particular disagreement they have, and the judge need only analyze the smallest disagreement in order to declare an overall winner.
This suggests the following setup: players simultaneously provide an answer supported with subclaims. Then, after looking at the other player’s answer and subclaims, they can provide objections (perhaps on particular subclaims, or the overall structure of the argument). Players can concede or challenge objections left by the other player; the debate then recurses on one of the challenged objections (currently, the malicious player chooses which claim to recurse on, in order to be conservative). There is a depth limit to the recursion, once it is reached (or if there are no challenged objections) the judge is given a transcript of the debate and must declare a winner.
A new problem arises: the malicious player can exploit ambiguity to their advantage. For example, suppose the malicious player is arguing that there are 5 primes between 0 and 10, and makes the subclaims “there are 3 primes between 0 and 5” and “there are 2 primes between 5 and 10“. If challenged on the first, they will say “the three primes are 2, 3, and 5”, if challenged on the second, they will say “the two primes are 5 and 7”. While in this simple case the honest player could register a top-level objection claiming that the word “between” is ambiguous, and ask for clarification, in general it would be too hard to eliminate all ambiguity.
To mitigate this, the team added a method of _cross examination_, in which each player is allowed to ask an earlier version of the other agent a question, and quote its answer as part of the debate transcript. (The post talks about how this can be approximated with humans, but I won’t go into the details.) In the complexity theory analogy, this boosts debate from PSPACE to NEXP. Intuitively, cross examination allows players to force each other to precommit to specific claims and arguments, because if they don’t stay consistent in their claims and arguments, then their opponent can demonstrate the inconsistency. Now, if the malicious player tries to make an argument that depends on interpreting an ambiguous concept in two different ways, then the honest player can cross examine and ask which of the two interpretations they mean. If they are inconsistent, that can be demonstrated to the judge; if they consistently answer one way, then the honest player can challenge the part of the argument that depends on the other interpretation.
They then identify several open concerns with debate, of which they highlight the long computation problem. This is a problem when you no longer assume that the debaters have optimal play: in this case, the malicious player could create a complicated argument that neither debater understands well, that supports the malicious case but that the honest player doesn’t know how to refute.
Planned opinion:
I enjoyed this a lot: the problems found were crisp and the solutions had good arguments that they actually solved the identified problem. Reading through the actual examples and arguments made me more optimistic about debate in general, mostly from a felt sense that the actual concrete results were getting closer to matching the theoretical ideal, and that there actually could be reasonable solutions to “messy” problems like ambiguity.
The full post has formal explanations and actual examples, which I highly recommend.
Planned summary for the Alignment Newsletter:
Planned opinion: