Nice paper! I especially liked the analysis of cases in which feature debate works.
I have two main critiques:
The definition of truth-seeking seems strange to me: while you quantify it via the absolute accuracy of the debate outcome, I would define it based on the relative change in the judge’s beliefs (whether the beliefs were more accurate at the end of the debate than at the beginning).
The feature debate formalization seems quite significantly different from debate as originally imagined.
I’ll mostly focus on the second critique, which is the main reason that I’m not very convinced by the examples in which feature debate doesn’t work. To me, the important differences are:
Feature debate does not allow for decomposition of the question during the argument phase
Feature debate does not allow the debaters to “challenge” each other with new questions.
I think this reduces the expressivity of feature debate from PSPACE to P (for polynomially-bounded judges).
In particular, with the original formulation of debate, the idea is that a debate of length n would try to approximate the answer that would be found by a tree of depth n of arguments and counterarguments (which has exponential size). So, even if you have a human judge who can only look at a polynomial-length debate, you can get results that would have been obtained from an exponential-sized tree of arguments (which can be simulated in PSPACE).
In contrast, with feature debates, the (polynomially-bounded) judge only updates on the evidence presented in the debate itself, which means that you can only do a polynomial amount of computation.
You kind of sort of mention this in the limitations, under the section “Commitments and high-level claims”, but the proposed improved model is:
To reason about such debates, we further need a model which relates the different commitments, to arguments, initial answers, and each other. One way to get such a model is to view W as the set of assignments for a Bayesian network. In such setting, each question q ∈ Q would ask about the value of some node in W, arguments would correspond to claims about node values, and their connections would be represented through the structure of the network. Such a model seems highly structured, amenable to theoretical analysis, and, in the authors’ opinion, intuitive. It is, however, not necessarily useful for practical implementations of debate, since Bayes networks are computationally expensive and difficult to obtain.
This still seems to me to involve the format in which the judge can only update on the evidence presented in the debate (though it’s hard to say without more details). I’d be much more excited about a model in which the agents can make claims about a space of questions, and as a step of the argument can challenge each other on any question from within that space, which enables the two points I listed above (decomposition and challenging).
----
Going through each of the examples in Section 4.2:
Unfair questions. A question may be difficult to debate when arguing for one side requires more complex arguments. Indeed, consider a feature debate in a world w uniformly sampled from Boolean-featured worlds Πi∈NWi = {0, 1}N, and suppose the debate asks about the conjunctive function ϕ := W1 ∧ . . . ∧ WK for some K ∈ N.
This could be solved by regular debate easily, if you can challenge each other. In particular, it can be solved in 1 step: if the opponent’s answer is anything other than 1, challenge them with the question ∃i,1≤i≤K:Wi=0, and if they do respond with an answer, disagree with them, which the judge can check.
Arguably that question should be “out-of-bounds”, because it’s “more complex” than the original question. In that case, regular debate could solve it in O(logK) steps: use binary search to halve the interval on which the agents disagree, by challenging agents on the question ϕ:=Wl∧Wl+1∧…∧Wh for the interval [l,h] starting from the interval [1,K].
Now, if K>2N, then even this strategy doesn’t work. This is basically because at that size, even an exponential-sized tree of bounded agents is unable to figure out the true answer. This seems fine to me; if we really need even more powerful agents, we could do iterated debate. (This is effectively treating debate as an amplification step within the general framework of iterated amplification.)
Unstable debates. Even if a question does not bias the debate against the true answer as above, the debate outcome might still be uncertain until the very end. One way this could happen is if the judge always feels that more information is required to get the answer right. [...] consider the function ψ := xor(W1, . . . , WK) defined on worlds with Boolean features.
This case can also be handled via binary search as above. But you could have other functions that don’t nicely decompose, and then this problem would still occur. In this case, the optimal answer is 12, as you note; this seems fine to me? The judge started out with a belief of 12, and at the end of the debate it stayed the same. So the debate didn’t help, but it didn’t hurt either; it seems fine if we can’t use debate for arbitrary questions, as long as it doesn’t lie to us about those questions. (When using natural language, I would hope for an answer like “This debate isn’t long enough to give evidence one way or the other”.)
To achieve the “always surprised and oscillating” pattern, we consider a prior π under which each each feature wi is sampled independently from {0, 1}, but in a way that is skewed towards Wi = 0.
If you condition on a very surprising world, then it seems perfectly reasonable for the judge to be constantly surprised. If you sampled a world from that prior and ran debate, then the expected surprise of the judge would be low. (See also the second bullet point in this comment.)
Distracting evidence. For some questions, there are misleading arguments that appear plausible and then require extensive counter-argumentation to be proven false.
This is the sort of thing where the full exponential tree can deal with it because of the ability to decompose the question, but a polynomial-time “evidence collection” conversation could not. In your specific example, you want the honest agent to be able to challenge the dishonest agent on the questions f(w) and ∀m,n:S(wm,wn). This allows you to quickly focus down on which S(wm,wn) the agents disagree about, and then the honest agent only has to refute that one stalling case, allowing it to win the debate.
A quick reaction to the truth-seeking definition: When writing the definition (of truth-promotion), I imagined a (straw) scenario where I am initially uncertain about what the best answer is—perhaps I have some belief, but upon reflection, I put little credence in it. In particular, I wouldn’t be willing to act on it. Then I run the debate, become fully convinced that the debate’s outcome is the correct answer, and act on it.
The other story seems also valid: you start out with some belief, update it based on the debate, and you want to know how much the debate helped. Which of the two options is better will, I guess, depend on the application in mind.
“I’d be much more excited about a model in which the agents can make claims about a space of questions, and as a step of the argument can challenge each other on any question from within that space,”
To dissolve a possible confusion: By “claims about a space of questions” you mean “a claim about every question from a space of questions”? Would this mean that the agents would commit to many claims at once (possibly more than the human judge can understand at once)? (Something I recall Beth Barnes suggesting.) Or do you mean that they would make a single “meta” claim, understandable by the judge, that specified many smaller claims (eg, “for any meal you ask me to cook, I will be able to cook it better than any of my friends”; horribly false, btw.)?
Anyway, yeah, I agree that this seems promising. I still don’t know how to capture the relations between different claims (which I somehow expect to be important if we are to prove some guarantees for debate).
I agree with your high-level points regarding the feature debate formalization.
I should clarify one thing that might not be apparent from the paper: the message of the counterexamples was meant to be “these are some general issues which we expect to see in debate, and here is how they can manifest in the feature debate toy model”, rather than “these specific examples will be a problem in general debates”. In particular, I totally agree that the specific examples immediatelly go away if you allow the agents to challenge each others’ claims. However, I have an intuition that even with other debate protocols, similar general issues might arise with different specific examples.
For example, I guess that even with other debate protocols, you will be “having a hard time when your side requires too difficult arguments”. I imagine there will always be some maximum “inferential distance that a debater can bridge” (with the given judge and debate protocol). And any claim which requires more supporting arguments than this will be a lost cause.
How will such an example look like? Without a specific debate design, I can’t really say.
Either way, if true, it becomes important whether you will be able to convincingly argue that a question is too difficult to explain (without making this a universal strategy even in cases where it shouldn’t apply).
A minor point:
“If you condition on a very surprising world, then it seems perfectly reasonable for the judge to be constantly surprised.”
I agree with your point here—debate being wrong in a very unlikely world is not a bug. However, you can also get the same behaviour in a typical world if you assume that the judge has a wrong prior. So the claim should be “rational judges can have unstable debates in unlikely worlds” and “biased judges can have unstable debates even in typical worlds”.
By “claims about a space of questions” you mean “a claim about every question from a space of questions”?
I just wrote incorrectly; I meant “the agent can choose a question from a space of questions and make a claim about it”. If you want to support claims about a space of questions, you could allow quantifiers in your questions.
However, you can also get the same behaviour in a typical world if you assume that the judge has a wrong prior.
I mean, sure, but any alignment scheme is going to have to assume some amount of correctness in the human-generated information it is given. You can’t learn about preferences if you model humans as arbitrarily wrong about their preferences.
Nice paper! I especially liked the analysis of cases in which feature debate works.
I have two main critiques:
The definition of truth-seeking seems strange to me: while you quantify it via the absolute accuracy of the debate outcome, I would define it based on the relative change in the judge’s beliefs (whether the beliefs were more accurate at the end of the debate than at the beginning).
The feature debate formalization seems quite significantly different from debate as originally imagined.
I’ll mostly focus on the second critique, which is the main reason that I’m not very convinced by the examples in which feature debate doesn’t work. To me, the important differences are:
Feature debate does not allow for decomposition of the question during the argument phase
Feature debate does not allow the debaters to “challenge” each other with new questions.
I think this reduces the expressivity of feature debate from PSPACE to P (for polynomially-bounded judges).
In particular, with the original formulation of debate, the idea is that a debate of length n would try to approximate the answer that would be found by a tree of depth n of arguments and counterarguments (which has exponential size). So, even if you have a human judge who can only look at a polynomial-length debate, you can get results that would have been obtained from an exponential-sized tree of arguments (which can be simulated in PSPACE).
In contrast, with feature debates, the (polynomially-bounded) judge only updates on the evidence presented in the debate itself, which means that you can only do a polynomial amount of computation.
You kind of sort of mention this in the limitations, under the section “Commitments and high-level claims”, but the proposed improved model is:
This still seems to me to involve the format in which the judge can only update on the evidence presented in the debate (though it’s hard to say without more details). I’d be much more excited about a model in which the agents can make claims about a space of questions, and as a step of the argument can challenge each other on any question from within that space, which enables the two points I listed above (decomposition and challenging).
----
Going through each of the examples in Section 4.2:
This could be solved by regular debate easily, if you can challenge each other. In particular, it can be solved in 1 step: if the opponent’s answer is anything other than 1, challenge them with the question ∃i,1≤i≤K:Wi=0, and if they do respond with an answer, disagree with them, which the judge can check.
Arguably that question should be “out-of-bounds”, because it’s “more complex” than the original question. In that case, regular debate could solve it in O(logK) steps: use binary search to halve the interval on which the agents disagree, by challenging agents on the question ϕ:=Wl∧Wl+1∧…∧Wh for the interval [l,h] starting from the interval [1,K].
Now, if K>2N, then even this strategy doesn’t work. This is basically because at that size, even an exponential-sized tree of bounded agents is unable to figure out the true answer. This seems fine to me; if we really need even more powerful agents, we could do iterated debate. (This is effectively treating debate as an amplification step within the general framework of iterated amplification.)
This case can also be handled via binary search as above. But you could have other functions that don’t nicely decompose, and then this problem would still occur. In this case, the optimal answer is 12, as you note; this seems fine to me? The judge started out with a belief of 12, and at the end of the debate it stayed the same. So the debate didn’t help, but it didn’t hurt either; it seems fine if we can’t use debate for arbitrary questions, as long as it doesn’t lie to us about those questions. (When using natural language, I would hope for an answer like “This debate isn’t long enough to give evidence one way or the other”.)
If you condition on a very surprising world, then it seems perfectly reasonable for the judge to be constantly surprised. If you sampled a world from that prior and ran debate, then the expected surprise of the judge would be low. (See also the second bullet point in this comment.)
This is the sort of thing where the full exponential tree can deal with it because of the ability to decompose the question, but a polynomial-time “evidence collection” conversation could not. In your specific example, you want the honest agent to be able to challenge the dishonest agent on the questions f(w) and ∀m,n:S(wm,wn). This allows you to quickly focus down on which S(wm,wn) the agents disagree about, and then the honest agent only has to refute that one stalling case, allowing it to win the debate.
Thank you for the comments!
A quick reaction to the truth-seeking definition: When writing the definition (of truth-promotion), I imagined a (straw) scenario where I am initially uncertain about what the best answer is—perhaps I have some belief, but upon reflection, I put little credence in it. In particular, I wouldn’t be willing to act on it. Then I run the debate, become fully convinced that the debate’s outcome is the correct answer, and act on it.
The other story seems also valid: you start out with some belief, update it based on the debate, and you want to know how much the debate helped. Which of the two options is better will, I guess, depend on the application in mind.
To dissolve a possible confusion: By “claims about a space of questions” you mean “a claim about every question from a space of questions”? Would this mean that the agents would commit to many claims at once (possibly more than the human judge can understand at once)? (Something I recall Beth Barnes suggesting.) Or do you mean that they would make a single “meta” claim, understandable by the judge, that specified many smaller claims (eg, “for any meal you ask me to cook, I will be able to cook it better than any of my friends”; horribly false, btw.)?
Anyway, yeah, I agree that this seems promising. I still don’t know how to capture the relations between different claims (which I somehow expect to be important if we are to prove some guarantees for debate).
I agree with your high-level points regarding the feature debate formalization. I should clarify one thing that might not be apparent from the paper: the message of the counterexamples was meant to be “these are some general issues which we expect to see in debate, and here is how they can manifest in the feature debate toy model”, rather than “these specific examples will be a problem in general debates”. In particular, I totally agree that the specific examples immediatelly go away if you allow the agents to challenge each others’ claims. However, I have an intuition that even with other debate protocols, similar general issues might arise with different specific examples.
For example, I guess that even with other debate protocols, you will be “having a hard time when your side requires too difficult arguments”. I imagine there will always be some maximum “inferential distance that a debater can bridge” (with the given judge and debate protocol). And any claim which requires more supporting arguments than this will be a lost cause. How will such an example look like? Without a specific debate design, I can’t really say. Either way, if true, it becomes important whether you will be able to convincingly argue that a question is too difficult to explain (without making this a universal strategy even in cases where it shouldn’t apply).
A minor point:
I agree with your point here—debate being wrong in a very unlikely world is not a bug. However, you can also get the same behaviour in a typical world if you assume that the judge has a wrong prior. So the claim should be “rational judges can have unstable debates in unlikely worlds” and “biased judges can have unstable debates even in typical worlds”.
I broadly agree with all of this, thanks :)
I just wrote incorrectly; I meant “the agent can choose a question from a space of questions and make a claim about it”. If you want to support claims about a space of questions, you could allow quantifiers in your questions.
I mean, sure, but any alignment scheme is going to have to assume some amount of correctness in the human-generated information it is given. You can’t learn about preferences if you model humans as arbitrarily wrong about their preferences.