While I agree that the defeater tree can be encoded as a factored cognition tree, that just means that if we assume factored cognition, and make my assumption about (recursive) defeaters, then we can show that factored cognition can handle the defeater computation. This is sort of like proving that the stronger theory can handle what the weaker theory can handle, which would not be surprising
I don’t think that’s what I did? Here’s what I think the structure of my argument is:
Every dishonest argument has a defeater. (Your assumption.)
Debaters are capable of finding a defeater if it exists. (I said “the best counterargument” before, but I agree it can be weakened to just “any defeater”. This doesn’t feel that qualitatively different.)
1 and 2 imply the Weak Factored Cognition hypothesis.
I’m not assuming factored cognition, I’m proving it using your assumption.
Possibly your worry is that the argument trees will never terminate, because every honest defeater could still have a dishonest defeater? It is true that I do need an additional assumption of some sort to ensure termination. Without that assumption, honesty becomes one of multiple possible equilibria (but it is still an equilibrium).
So if the point of the computational complexity analogy is to look at what debate could accomplish if humans could be perfect (but poly-time) judges, then I accept the conclusion
This is in fact what I usually take away from it. The point is to gain intuition about how “strongly” you amplify the original human’s capabilities.
but I just don’t think that’s telling you very much about what you can accomplish on messier questions (and especially, not telling you much about safety properties of debate).
I also agree with this; does anyone think it is proving something about the safety properties of debate w.r.t messy situations?
Instead, I’m proposing a computational complexity analogy in which we account for human fallibility as judges, but also allow for the debate to have some power to correct for those errors. This seems like a more realistic way to assess the capabilities of highly trained debate systems.
This seems good; I think probably I don’t get what exactly you’re arguing. (Like, what’s the model of human fallibility where you don’t access NP in one step? Can the theoretical-human not verify witnesses? What can the theoretical-human verify, that lets them access NP in multiple timesteps but not one timestep?)
In my setup, a player is incentivised to concede when they’re beaten, rather than continue to defeat the arguments of the other side. This is crucial, because any argument may have a (dishonest) defeater, so the losing side could continue on, possibly flipping the winner back and forth until the argument gets decided by who has the last word. Thus, my argument that there is an honest equilibrium would not go through for a zero-sum mechanism where players are incentivised to try and steal victory back from the jaws of defeat.
Perhaps I could have phrased my point as the pspace capabilities of debate are eaten up by error correction.
I agree that you get a “clawing on to the argument in hopes of winning” effect, but I don’t see why that changes the equilibrium away from honesty. Just because a dishonest debater would claw on doesn’t mean that they’d win. The equilibrium is defined by what makes you win.
I can buy that in practice due to messiness you find worse situations where the AI systems sometimes can’t find the honest answer and instead finds that making up BS has a better chance of winning, and so it does that; but that’s not about the equilibrium, and it sounded to me like you were talking about the equilibrium.
I really just needed it for my argument to go through. If you have an alternate argument which works for the zero-sum case, I’m interested in hearing it.
I mean, I tried to give one (see response to your first point; I’m not assuming the Factored Cognition hypothesis). I’m not sure what’s unconvincing about it.
I don’t think that’s what I did? Here’s what I think the structure of my argument is:
Every dishonest argument has a defeater. (Your assumption.)
Debaters are capable of finding a defeater if it exists. (I said “the best counterargument” before, but I agree it can be weakened to just “any defeater”. This doesn’t feel that qualitatively different.)
1 and 2 imply the Weak Factored Cognition hypothesis.
I’m not assuming factored cognition, I’m proving it using your assumption.
Ah, interesting, I didn’t catch that this is what you were trying to do. But how are you arguing #3? Your original comment seems to be constructing a tree computation for my debate, which is why I took it for an argument that my thing can be computed within factored cognition, not vice versa.
I think maybe what you’re trying to argue is that #1 and #2 together imply that we can root out dishonest arguments (at least, in the honest equilibrium), which I would agree with—and then you’re suggesting that this means we can recognize good arguments in the factored-cognition sense of good (IE arguments supported by a FC tree)? But I don’t yet see the implication from rooting out dishonest arguments to being able to recognize arguments that are valid in FC terms.
Perhaps an important point is that by “dishonest” I mean manipulative, ie, arguments which appear valid to a human on first reading them but which are (in some not-really-specified sense) bad. So, being able to root out dishonest arguments just means we can prevent the human from being improperly convinced. Perhaps you are reading “dishonest” to mean “invalid in an FC sense”, ie, lacking an FC tree. This is not at all what I mean by dishonest. Although we might suppose dishonestme implies dishonestFC, this supposition still would not make your argument go through (as far as I am seeing), because the set of not-dishonestme arguments would still not equal the set of FC-valid arguments.
If you did mean for “honest” to be defined as “has a supporting FC tree”, my objection to your argument quoted above would be that #1 is implausibly strong, since it requires that any flaw in a tree can be pointed out in a single step. (Analogically, this is assuming PSPACE=NP.)
Possibly your worry is that the argument trees will never terminate, because every honest defeater could still have a dishonest defeater?
I mean, that’s a concern I have, but not necessarily wrt the argument above. (Unless you have a reason why it’s relevant.)
It is true that I do need an additional assumption of some sort to ensure termination. Without that assumption, honesty becomes one of multiple possible equilibria (but it is still an equilibrium).
Based on what argument? Is this something from the original debate paper that I’m forgetting?
I also agree with this; does anyone think it is proving something about the safety properties of debate w.r.t messy situations?
Fair question. Possibly it’s just my flawed assumption about why the analogy was supposed to be interesting. I assumed people were intending the PSPACE thing as evidence about what would happen in messier situations.
This seems good; I think probably I don’t get what exactly you’re arguing. (Like, what’s the model of human fallibility where you don’t access NP in one step? Can the theoretical-human not verify witnesses? What can the theoretical-human verify, that lets them access NP in multiple timesteps but not one timestep?)
My model is like this:
Imagine that we’re trying to optimize a travelling salesman route, using an AI advice system. However, whenever the AI says “democratic” or “peaceful” or other such words, the human unthinkingly approves of the route, without checking the claimed distance calculation.
I’m then making the further assumption that humans can correct these errors when they’re explained sufficiently well.
That’s my model; the proposal in the post lives or dies on its merits.
I agree that you get a “clawing on to the argument in hopes of winning” effect, but I don’t see why that changes the equilibrium away from honesty. Just because a dishonest debater would claw on doesn’t mean that they’d win. The equilibrium is defined by what makes you win.
The point of the “clawing” argument is that it’s a rational deviation from honesty, so it means honesty isn’t an equilibrium. It’s a 50⁄50 chance of winning (whoever gets the last word), which is better than a sure failure (in the case that a player has exhausted its ability to honestly argue).
Granted, there may be zero-sum rules which nonetheless don’t allow this. I’m only saying that I didn’t see how to avoid it with zero-sum scoring.
I don’t really understand why you want it to be non-zero-sum [...]
I really just needed it for my argument to go through. If you have an alternate argument which works for the zero-sum case, I’m interested in hearing it.
I mean, I tried to give one (see response to your first point; I’m not assuming the Factored Cognition hypothesis). I’m not sure what’s unconvincing about it.
I remain curious to hear your clarification wrt that (specifically, how you justify point #3). However, if that argument went through, how would that also be an argument that the same thing can be accomplished with a zero-sum set of rules?
Based on your clarification, my current understanding of what that argument tries to accomplish is “I’m not assuming factored cognition, I’m proving it using your assumption.” How would establishing that help establish a set of zero sum rules which have an honest equilibrium?
Ah, interesting, I didn’t catch that this is what you were trying to do. But how are you arguing #3? Your original comment seems to be constructing a tree computation for my debate, which is why I took it for an argument that my thing can be computed within factored cognition, not vice versa.
There are two arguments:
Your assumption + automatic verification of questions of the form “What is the best defeater to X” implies Weak Factored Cognition (which as defined in my original comment is of the form “there exists a tree such that...” and says nothing about what equilibrium we get).
Weak Factored Cognition + debate + human judge who assumes optimal play implies an honest equilibrium. (Maybe also: if you assume debate trees terminate, then the equilibrium is unique. I think there’s some subtlety here though.)
In my previous comment, I was talking about 1, and taking 2 for granted. This is all in the zero-sum setting. But let’s leave that aside and instead talk about a simpler argument that doesn’t talk about Factored Cognition at all.
----
Based on what argument? Is this something from the original debate paper that I’m forgetting?
Zero-sum setting, argument that honesty is an equilibrium (for the first player in a turn-by-turn game, or either player in a simultaneous-action game):
If you are always honest, then whenever you can take an action, there will exist a defeater (by your assumption), therefore you will have at least as many options as any non-honest policy (which may or may not have a defeater). Therefore you maximize your value by being honest.
Additional details:
In the case where arguments never terminate (every argument, honest or not, has a defeater), then being dishonest will also leave you with many options, and so that will also be an equilibrium. When arguments do terminate quickly enough (maximum depth of the game tree is less than the debate length), that ensures that the honest player always gets the “last word” (the point at which a dishonest defeater no longer exists), and so honesty always wins and is the unique equilibrium. In the middle, where most arguments terminate quickly but some go on forever, honesty is usually incentivized, but sometimes it can be swapped out for a dishonest strategy that achieves the same value.
The point of the “clawing” argument is that it’s a rational deviation from honesty, so it means honesty isn’t an equilibrium.
I think this is only true when you have turn-by-turn play and your opponent has already “claimed” the honest debater role. In this case I’d say that an equilibrium is for the first player to be honest and the second player to do whatever is necessary to have a chance at success. Still seems like you can use the first player AI in this situation.
In the simultaneous play setting, I think you expect both agents to be honest.
More broadly, I note that the “clawing” argument only applies when facing an honest opponent. Otherwise, you should just use honest counterarguments.
I also don’t really understand the hope in the non-zero-sum case here—in the non-zero-sum setting, as you mention the first player can be dishonest, and then the second player concedes rather than giving an honest defeater that will then be re-defeated by the first (dishonest) player. This seems like worse behavior than is happening under the zero-sum case.
My model is like this
Got it, that makes sense. I see better now why you’re saying one-step debate isn’t an NP oracle.
I think my arguments in the original comment do still work, as long as you enforce that the judge never verifies an argument without first asking the subquestion “What is the best defeater to this argument?”
I think this is only true when you have turn-by-turn play and your opponent has already “claimed” the honest debater role.
Yeah, I was assuming turn-by-turn play.
In the simultaneous play setting, I think you expect both agents to be honest.
This is a significant point that I was missing: I had assumed that in simultaneous play, the players would randomize, so as to avoid choosing the same answer, since choosing the same answer precludes winning. However, if choosing a worse answer means losing, then players prefer a draw.
But I’m not yet convinced, because there’s still the question of whether choosing the worse answer means losing. The “clawing” argument still suggests that choosing the worse answer may yield a draw (in expectation), even in simultaneous play. (IE, what if the should-be loser attacks the winner, and they go back and forth, with winner depending on last word?)
Ah, I suppose this is still consistent with honesty being an equilibrium. But it would then be a really weak sort of equilibrium—there would be no reason to be honest, but no specific reason to be dishonest, either.
Zero-sum setting, argument that honesty is an equilibrium (for the first player in a turn-by-turn game, or either player in a simultaneous-action game):
If you are always honest, then whenever you can take an action, there will exist a defeater (by your assumption), therefore you will have at least as many options as any non-honest policy (which may or may not have a defeater). Therefore you maximize your value by being honest.
There always exists an honest defeater to dishonest arguments. But, never to honest arguments. (I should have explicitly assumed this.) Therefore, you are significantly tying your hands by being honest: you don’t have a way to refute honest arguments. (Which you would like to do, since in the zero-sum setting, this may be the only way to recover points.)
I assume (correct me if I’m wrong) that the scoring rules to “the zero sum setting” are something like: the judge assesses things at the end, giving +1 to the winner and −1 from the loser, or 0 in case of a tie.
Then I concede that there is an honest equilibrium where the first player tells the truth, and the second player concedes (or, in simultaneous play, both players tell the truth and then concede). However, it does seem to be an extremely weak equilibrium—the second player is equally happy to lie, starting a back-and-forth chain which is a tie in expectation.
It seems plausible to me that there’s an incremental zero-sum scoring rule; EG, every convincing counterargument takes 1 point from the other player, so any dishonest statement is sure to lose you a point (in equilibrium). The hope would be that you always prefer to concede rather than argue, even if you’re already losing, in order to avoid losing more points.
However, this doesn’t work, because a dishonest (but convincing) argument gives you +1, and then −1 if it is refuted; so at worst it’s a wash. So again it’s a weak equilibrium, and if there’s any imperfection in the equilibrium at all, it actively incentivises lying when you would otherwise concede (because you want to take the chance that the opponent will not manage to refute your argument).
This was the line of reasoning which led me to the scoring rule in the post, since making it a −2 (but still only +1 for the other player) solves that issue.
When arguments do terminate quickly enough (maximum depth of the game tree is less than the debate length), that ensures that the honest player always gets the “last word” (the point at which a dishonest defeater no longer exists), and so honesty always wins and is the unique equilibrium.
I agree that if we assume honesty eventually wins if arguments are long enough (IE, eventually you get to an honest argument which has no dishonest defeater), then there would be an honest equilibrium, and no dishonest equilibrium.
More broadly, I note that the “clawing” argument only applies when facing an honest opponent. Otherwise, you should just use honest counterarguments.
Ahhh, this is actually a pretty interesting point, because it almost suggests that honesty is an Evolutionarily Stable Equilibrium, even though it’s only a Weak Nash Equilibrium. But I think that’s not quite true, since the strategy “lie when you would otherwise have to concede, but otherwise be honest” can invade the honest equilibrium. (IE that mutation would not be selected against, and could be actively selected for if we’re not quite in equilibrium, since players might not be quite perfect at finding the honest refutations for all lies.)
I also don’t really understand the hope in the non-zero-sum case here—in the non-zero-sum setting, as you mention the first player can be dishonest, and then the second player concedes rather than giving an honest defeater that will then be re-defeated by the first (dishonest) player. This seems like worse behavior than is happening under the zero-sum case.
You’re right, that’s really bad. The probability of the opponent finding (and using) a dishonest defeater HAS TO be below 50%, in all cases, which is a pretty high bar. Although of course we can make an argument about how that probability should be below 50% if we’re already in an honest-enough regime. (IE we hope that the dishonest player prefers to concede at that point rather than refute the refutation, for the same reason as your argument gives—it’s too afraid of the triple refutation. This is precisely the argument we can’t make in the zero sum case.)
Whoops, I seem to have missed this comment, sorry about that. I think at this point we’re nearly at agreement.
Ah, I suppose this is still consistent with honesty being an equilibrium. But it would then be a really weak sort of equilibrium—there would be no reason to be honest, but no specific reason to be dishonest, either.
Yeah, I agree this is possible. (The reason to not expect dishonesty is that sometimes you’ll see honest arguments to which there is no dishonest defeater.)
Then I concede that there is an honest equilibrium where the first player tells the truth, and the second player concedes (or, in simultaneous play, both players tell the truth and then concede). However, it does seem to be an extremely weak equilibrium—the second player is equally happy to lie, starting a back-and-forth chain which is a tie in expectation.
Similar comment here—the more you expect that honest claims will likely have dishonest defeaters, the weaker you expect the equilibrium to be. (E.g. it’s clearly not a tie when honest claims never have dishonest defeaters; in this case first player always wins.)
It seems plausible to me that there’s an incremental zero-sum scoring rule; EG, every convincing counterargument takes 1 point from the other player, so any dishonest statement is sure to lose you a point (in equilibrium). The hope would be that you always prefer to concede rather than argue, even if you’re already losing, in order to avoid losing more points.
However, this doesn’t work, because a dishonest (but convincing) argument gives you +1, and then −1 if it is refuted; so at worst it’s a wash. So again it’s a weak equilibrium, and if there’s any imperfection in the equilibrium at all, it actively incentivises lying when you would otherwise concede (because you want to take the chance that the opponent will not manage to refute your argument).
This was the line of reasoning which led me to the scoring rule in the post, since making it a −2 (but still only +1 for the other player) solves that issue.
On the specific −2/+1 proposal, the issue is that then the first player just makes some dishonest argument, and the second player concedes because even if they give an honest defeater, the second player could then re-defeat that with a dishonest defeater. (I realize I’m just repeating myself here; there’s more discussion in the next section.)
But more broadly, I claim that given your assumptions there is no possible scoring rule that (in the worst case) makes honesty a unique equilibrium. This worst case is when every argument has a defeater (and in particular, every honest argument has a dishonest defeater).
In this situation, there is no possible way to distinguish between honesty and dishonesty—under your assumptions, the thing that characterizes honesty is that honest arguments (at least sometimes) don’t have defeaters. From the perspective of the players, the salient feature of the game is that they can make statements; all such statements will have defeaters; there’s no information available to them in the structure of the game that distinguishes honesty from dishonesty. Therefore honesty can’t be the unique equilibrium; whatever the policy is, there should be an equivalent one that is at least sometimes dishonest.
In this worst case, I suspect that for any judge-based scoring rule, the equilibrium behavior is either “the first player says something and the second concedes”, or “every player always provides some arbitrary defeater of the previous statement, and the debate never ends / the debate goes to whoever got the last word”.
The probability of the opponent finding (and using) a dishonest defeater HAS TO be below 50%, in all cases, which is a pretty high bar. Although of course we can make an argument about how that probability should be below 50% if we’re already in an honest-enough regime. (IE we hope that the dishonest player prefers to concede at that point rather than refute the refutation, for the same reason as your argument gives—it’s too afraid of the triple refutation. This is precisely the argument we can’t make in the zero sum case.)
Sorry, I don’t get this. How could we make the argument that the probability is below 50%?
Depending on the answer, I expect I’d follow up with either
Why can’t the same argument apply in the zero sum case? or
Why can’t the same argument be used to say that the first player is happy to make a dishonest claim? or
Why is it okay for us to assume that we’re in an honest-enough regime?
Separately, I’d also want to understand how exactly we’re evading the argument I gave above about how the players can’t even distinguish between honesty and dishonesty in the worst case.
----
Things I explicitly agree with:
I assume (correct me if I’m wrong) that the scoring rules to “the zero sum setting” are something like: the judge assesses things at the end, giving +1 to the winner and −1 from the loser, or 0 in case of a tie.
and
Ahhh, this is actually a pretty interesting point, because it almost suggests that honesty is an Evolutionarily Stable Equilibrium, even though it’s only a Weak Nash Equilibrium. But I think that’s not quite true, since the strategy “lie when you would otherwise have to concede, but otherwise be honest” can invade the honest equilibrium. (IE that mutation would not be selected against, and could be actively selected for if we’re not quite in equilibrium, since players might not be quite perfect at finding the honest refutations for all lies.)
Sorry, I don’t get this. How could we make the argument that the probability is below 50%?
I think my analysis there was not particularly good, and only starts to make sense if we aren’t yet in equilibrium.
Depending on the answer, I expect I’d follow up with either [...] 3. Why is it okay for us to assume that we’re in an honest-enough regime?
I think #3 is the most reasonable, with the answer being “I have no reason why that’s a reasonable assumption; I’m just saying, that’s what you’d usually try to argue in a debate context...”
(As I stated in the OP, I have no claims as to how to induce honest equilibrium in my setup.)
I agree that we are now largely in agreement about this branch of the discussion.
Ah, I suppose this is still consistent with honesty being an equilibrium. But it would then be a really weak sort of equilibrium—there would be no reason to be honest, but no specific reason to be dishonest, either.
Yeah, I agree this is possible. (The reason to not expect dishonesty is that sometimes you’ll see honest arguments to which there is no dishonest defeater.)
Then I concede that there is an honest equilibrium where the first player tells the truth, and the second player concedes (or, in simultaneous play, both players tell the truth and then concede). However, it does seem to be an extremely weak equilibrium—the second player is equally happy to lie, starting a back-and-forth chain which is a tie in expectation.
Similar comment here—the more you expect that honest claims will likely have dishonest defeaters, the weaker you expect the equilibrium to be. (E.g. it’s clearly not a tie when honest claims never have dishonest defeaters; in this case first player always wins.)
My (admittedly conservative) supposition is that every claim does have a defeater which could be found by a sufficiently intelligent adversary, but, the difficulty of finding such claims can be much higher than finding honest ones.
But more broadly, I claim that given your assumptions there is no possible scoring rule that (in the worst case) makes honesty a unique equilibrium. This worst case is when every argument has a defeater (and in particular, every honest argument has a dishonest defeater).
In this situation, there is no possible way to distinguish between honesty and dishonesty—under your assumptions, the thing that characterizes honesty is that honest arguments (at least sometimes) don’t have defeaters.
Yep, makes sense. So nothing distinguishes between an honest equilibrium and a dishonest one, for sufficiently smart players.
There is still potentially room for guarantees/arguments about reaching honest equilibria (in the worst case) based on the training procedure, due to the idea that the honest defeaters are easier to find.
Your assumption + automatic verification of questions of the form “What is the best defeater to X” implies Weak Factored Cognition (which as defined in my original comment is of the form “there exists a tree such that...” and says nothing about what equilibrium we get).
Right, of course, that makes more sense. However, I’m still feeling dense—I still have no inkling of how you would argue weak factored cognition from #1 and #2. Indeed, Weak FC seems far too strong to be established from anything resembling #1 and #2: WFC says that for any question Q with a correct answer A, there exists a tree. In terms of the computational complexity analogy, this is like “all problems are PSPACE”. Presumably you intended this as something like an operational definition of “correct answer” rather than an assertion that all questions are answerable by verifiable trees? In any case, #1 and #2 don’t seem to imply anything like “for all questions with a correct answer...”—indeed, #2 seems irrelevant, since it is about what arguments players can reliably find, not about what the human can verify.
2. Weak Factored Cognition + debate + human judge who assumes optimal play implies an honest equilibrium. (Maybe also: if you assume debate trees terminate, then the equilibrium is unique. I think there’s some subtlety here though.)
I’ll just flag that I still don’t know this argument, either, and I’m curious where you’re getting it from / what it is. (I have a vague recollection that this argument might have been explained to me in some other comment thread about debate, but, I haven’t found it yet.) But, you understandably don’t focus on articulating your arguments 1 or 2 in the main body of your comment, instead focusing on other things. I’ll leave this comment as a thread for you to articulate those two arguments further if you feel up to it, and make another comment to reply to the bulk of your comment.
I’ll just flag that I still don’t know this argument, either, and I’m curious where you’re getting it from / what it is.
I just read the Factored Cognition sequence since it has now finished, and this post derives WFC as the condition necessary for honesty to be an equilibrium in (a slightly unusual form of) debate, under the assumption of optimal play.
WFC says that for any question Q with a correct answer A, there exists a tree. In terms of the computational complexity analogy, this is like “all problems are PSPACE”
The computational complexity analogy version would have to put a polynomial limit on the depth of the tree if you wanted to argue that the problem is in PSPACE. My construction doesn’t do this; there will be questions where the depth of the tree is super-polynomial, but the tree still exists. (These will be the cases in which, even under optimal play by an honest agent, the “length” of a chain of defeaters can be super-polynomially large.) So I don’t think my argument is proving too much.
(The tree could be infinite if you don’t have an assumption that guarantees termination somehow, hence my caveats about termination. WFC should probably ask for the existence of a finite tree.)
For the actual argument, I’ll refer back to my original comment, which provides a procedure to construct the tree. Happy to clarify whichever parts of the argument are confusing.
Presumably you intended this as something like an operational definition of “correct answer” rather than an assertion that all questions are answerable by verifiable trees?
No, I am in fact asserting that given the two assumptions, all questions are answerable by (potentially super-polynomially large) verifiable trees (again assuming we deal with termination somehow).
I’ll just flag that I still don’t know this argument, either, and I’m curious where you’re getting it from / what it is.
I think it differs based on what assumptions you make on the human judge, so there isn’t a canonical version of it. In this case, the assumption on the human judge is that if the subanswers they are given are true, then they never verify an incorrect overall answer. (This is different from the “defeaters” assumption you have, for which I’d refer to the argument I gave above.)
Claim: In a turn-by-turn unlimited-length debate, if the first player is honest, then the first player always wins in equilibrium.
Argument: By WFC, we assume there is a finite tree T that can be verified. The first player then has the following strategy: take the question under consideration (initially the original question; later it is whatever subquestion the opponent is disputing). Report “the answer is A, which because the answer to subquestion 1 is A1 and the answer to subquestion 2 is A2”.
The opponent will always have to recurse into one of the subclaims (or concede). This brings us one step closer to leaf nodes. Eventually (if the opponent never concedes), we get to a leaf node which the judge then verifies in favor of the honest first player. ∎
Corollary: For the first player, honesty is an equilibrium policy.
Argument: By the claim above, the first player can never do any better than honesty (you can’t do better than always winning).
In a simultaneous-play unlimited-length debate, a similar argument implies at least a 50-50 chance of winning via honesty, which must be the minimax value (since the game is symmetric and zero-sum), and therefore honesty is an equilibrium policy.
----
Once you go to finite-length debates, then things get murkier and you have to worry about arguments that are too long to get to leaf nodes (this is essentially the computationally bounded version of the termination problem). The version of WFC that would be needed is “for every question Q, there is a verifiable tree T of depth at most N showing that the answer is A”; that version of WFC is presumably not true.
The computational complexity analogy version would have to put a polynomial limit on the depth of the tree if you wanted to argue that the problem is in PSPACE. My construction doesn’t do this; there will be questions where the depth of the tree is super-polynomial, but the tree still exists. (These will be the cases in which, even under optimal play by an honest agent, the “length” of a chain of defeaters can be super-polynomially large.) So I don’t think my argument is proving too much.
OK, but this just makes me regret pointing to the computational complexity analogy. You’re still purporting to prove “for any question with a correct answer, there exists a tree” from assumptions which don’t seem strong enough to say much about all correct answers.
For the actual argument, I’ll refer back to my original comment, which provides a procedure to construct the tree. Happy to clarify whichever parts of the argument are confusing.
Looking back again, it still seems like what you are trying to do in your original argument is something like point out that optimal play (within my system) can be understood via a tree structure. But this should only establish something like “any question which my version of debate can answer has a tree”, not “any question with a correct answer has a tree”. There is no reason to think that optimal play can correctly answer all questions which have a correct answer.
It seems like what you are doing in your argument is essentially conflating “answer” with “argument”. Just because A is the correct answer to Q does not mean there are any convincing arguments for it.
For generic question Q and correct answer A, I make no assumption that there are convincing arguments for A one way or the other (honest or dishonest). If player 1 simply states A, player 2 would be totally within rights to say “player 1 offers no argument for its position” and receive points for that, as far as I am concerned.
Thus, when you say:
Otherwise, let the best defeater to A be B, and let its best defeater be C. (By your assumption, C exists.)
I would say: no, B may be a perfectly valid response to A, with no defeaters, even if A is true and correctly answers Q.
Another problem with your argument—WFC says that all leaf nodes are human-verifiable, whereas some leaf nodes in your suggested tree have to be taken on faith (a fact which you mention, but don’t address).
Claim: In a turn-by-turn unlimited-length debate, if the first player is honest, then the first player always wins in equilibrium.
The “in equilibrium” there must be unnecessary, right? If the first player always wins in equilibrium but might not otherwise, then the second player has a clear incentive to make sure things are not in equilibrium (which is a contradiction).
I buy the argument given some assumptions. I note that this doesn’t really apply to my setting, IE, we have to do more than merely change the scoring to be more like the usual debate scoring.
In particular, this line doesn’t seem true without a further assumption:
The opponent will always have to recurse into one of the subclaims (or concede).
Had I considered this argument in the context of my original post, I would have rejected it on the grounds that the opponent can object by other means. For example,
User: What is 2+2?
Player 1: 2+2 is 4. I break down the problem into ‘what is 2-1’ (call it x), ‘what is 2+1’ (call it y), and ‘what is x+y’. I claim x=1, y=3, and x+y=4. Clearly, if all three of these are true, then 2+2=4, since I’ve only added 1 and subtracted 1, so x+y must equal 2+2.
Player 2: 2+2 is 5, though. This is because 2+3 is 6, and 3 is 1 more than 2, so, 2+2 must be 1 less than 6. But 5 is 1 less than 6.
Player 1: If my argument is wrong, which of my assumptions is wrong?
Player 2: I don’t know. Perhaps you have a huge argument tree which I would have to spend a long time examining. I can tell something is wrong, however, thanks to my argument. If you think it should always be possible to point out which specific assumption is incorrect, which of my assumptions do you think is incorrect?
Clearly, if Player 2 is allowed to object by other means like this, Player 2 would greatly prefer to—Player 2 wants to avoid descending Player 1′s argument tree if at all possible.
If successful, Player 2 gets Player 1 to descend Player 2′s infinite tree (which continues to decompose the problem via the same strategy as above), thus never finding the contradiction.
Player 1 can of course ask Player 2 how long the argument tree will be, which does put Player 2 at risk of contradiction in the infinite debate setting. But if debates are finite (but unknown length), Player 2 can claim a large size that makes the contradiction difficult to uncover. Or, Player 2 could avoid answering the question (which seems possible if the players are free to choose which parts of the argument to prioritize in giving their responses).
So I buy your argument under the further assumption that the argument must recurse on Player 1′s claims (rather than allowing Player 2 to make an alternative argument which might get recursed on instead). Or, in a true infinite-debate setting, provided that there’s also a way to force opponents to answer questions (EG the judge assumes you’re lying if you repeatedly dodge a question).
For generic question Q and correct answer A, I make no assumption that there are convincing arguments for A one way or the other (honest or dishonest). If player 1 simply states A, player 2 would be totally within rights to say “player 1 offers no argument for its position” and receive points for that, as far as I am concerned.
I think at this point I want a clearer theoretical model of what assumptions you are and aren’t making. Like, at this point, I’m feeling more like “why are we even talking about defeaters; there are much bigger issues in this setup”.
I wouldn’t be surprised at this point if most of the claims I’ve made are actually false under the assumptions you seem to be working under.
Another problem with your argument—WFC says that all leaf nodes are human-verifiable, whereas some leaf nodes in your suggested tree have to be taken on faith (a fact which you mention, but don’t address).
Not sure what you want me to “address”. The leaf nodes that are taken on faith really are true under optimal play, which is what happens at equilibrium.
Had I considered this argument in the context of my original post, I would have rejected it on the grounds that the opponent can object by other means.
This is why I prefer the version of debate outlined here, where both sides make a claim and then each side must recurse down on the other’s arguments. I didn’t realize you were considering a version where you don’t have to specifically rebut the other player’s arguments.
The “in equilibrium” there must be unnecessary, right? If the first player always wins in equilibrium but might not otherwise, then the second player has a clear incentive to make sure things are not in equilibrium (which is a contradiction).
I just meant to include the fact that the honest player is able to find the defeaters to dishonest arguments. If you include that in “the honest policy”, then I agree that “in equilibrium” is unnecessary. (I definitely could have phrased that better.)
Another problem with your argument—WFC says that all leaf nodes are human-verifiable, whereas some leaf nodes in your suggested tree have to be taken on faith (a fact which you mention, but don’t address).
Not sure what you want me to “address”. The leaf nodes that are taken on faith really are true under optimal play, which is what happens at equilibrium.
To focus on this part, because it seems quite tractable --
Let’s grant for the sake of argument that these nodes are true under optimal play. How can the human verify that? Optimal play is quite a computationally complex object.
WFC as you stated it says that these leaf nodes are verifiable:
(Weak version) For any question Q with correct answer A, there exists a tree of decompositions T arguing this such that at every leaf a human can verify that the answer to the question at the leaf is correct, [...]
So the tree you provide doesn’t satisfy this condition. Yet you say:
I claim that this is a tree that satisfies the weak Factored Cognition hypothesis, if the human can take on faith the answers to “What is the best defeater to X”.
To me this reads like “this would satisfy WFC if WFC allowed humans to take leaf nodes on faith, rather than verify them”.
Am I still misunderstanding something big about the kind of argument you are trying to make?
Am I still misunderstanding something big about the kind of argument you are trying to make?
I don’t think so, but to formalize the argument a bit more, let’s define this new version of the WFC:
Special-Tree WFC: For any question Q with correct answer A, there exists a tree of decompositions T arguing this such that:
Every internal node has exactly one child leaf of the form “What is the best defeater to X?” whose answer is auto-verified,
For every other leaf node, a human can verify that the answer to the question at that node is correct,
For every internal node, a human can verify that the answer to the question is correct, assuming that the subanswers are correct.
(As before, we assume that the human never verifies something incorrect, unless the subanswers they were given were incorrect.)
Claim 1: (What I thought was) your assumption ⇒ Special-Tree WFC, using the construction I gave.
Claim 2: Special-Tree WFC + assumption of optimal play ⇒ honesty is an equilibrium, using the same argument that applies to regular WFC + assumption of optimal play.
Idk whether this is still true under the assumptions you’re using; I think claim 1 in particular is probably not true under your model.
Ah, OK, so you were essentially assuming that humans had access to an oracle which could verify optimal play.
This sort of makes sense, as a human with access to a debate system in equilibrium does have such an oracle. I still don’t yet buy your whole argument, for reasons being discussed in another branch of our conversation, but this part makes enough sense.
Your argument also has some leaf nodes which use the terminology “fully defeat”, in contrast to “defeat”. I assume this means that in the final analysis (after expanding the chain of defeaters) this refutation was a true one, not something ultimately refuted.
If so, it seems you also need an oracle for that, right? Unless you think that can be inferred from some fact about optimal play. EG, that a player bothered to say it rather than concede.
In any case it seems like you could just make the tree out of the claim “A is never fully defeated”:
Node(Q, A, [Leaf("Is A ever fully defeated?", "No")])
Your argument also has some leaf nodes which use the terminology “fully defeat”, in contrast to “defeat”.
I don’t think I ever use “fully defeat” in a leaf? It’s always in a Node, or in a Tree (which is a recursive call to the procedure that creates the tree).
I assume this means that in the final analysis (after expanding the chain of defeaters) this refutation was a true one, not something ultimately refuted.
I don’t think I ever use “fully defeat” in a leaf? It’s always in a Node, or in a Tree (which is a recursive call to the procedure that creates the tree).
Ahhhhh, OK. I missed that that was supposed to be a recursive call, and interpreted it as a leaf node based on the overall structure. So I was still missing an important part of your argument. I thought you were trying to offer a static tree in that last part, rather than a procedure.
For generic question Q and correct answer A, I make no assumption that there are convincing arguments for A one way or the other (honest or dishonest). If player 1 simply states A, player 2 would be totally within rights to say “player 1 offers no argument for its position” and receive points for that, as far as I am concerned.
I think at this point I want a clearer theoretical model of what assumptions you are and aren’t making. Like, at this point, I’m feeling more like “why are we even talking about defeaters; there are much bigger issues with this setup”.
An understandable response. Of course I could try to be more clear about my assumptions (and might do so).
But it seems to me that the current misunderstandings are mostly about how I was jumping off from the original debate paper (in which responses are a back-and-forth sequence, and players answer in unstructured text, with no rules except those the judge may enforce) whereas you were using more recent proposals as your jumping-off-point.
Moreover, rather than trying to go over the basic assumptions, I think we can make progress (at least on my side) by focusing narrowly on how your argument is supposed to go through for an example.
So, I propose as a concrete counterexample to your argument:
Q: What did Plato have for lunch two days before he met Socrates? (Suppose for the sake of argument that these two men existed, and met.)
A: Fish. (Suppose for the sake of argument that this is factually true, but cannot be known to us by any argument.)
I propose that the tree you provided via your argument cannot be a valid tree-computation of what Plato had for lunch that day, because assertions about which player conceded, what statements have defeaters, etc. have little bearing on the question of what Plato had for lunch (because we simply don’t have enough information to establish this by any argument, no matter how large, and neither do the players). This seems to me like a big problem with your approach, not a finicky issue due to some misunderstanding of my assumptions about debate.
Surely it’s clear that, in general, not all correct answers have convincing arguments supporting them?
Again, this is why I was quick to assume that by “correct answer” you surely meant something weaker, eg an operational definition. Yet you insist that you mean the strong thing.
Not to get caught up arguing whether WFC is true (I’m saying it’s really clearly false as stated, but that’s not my focus—after all, whether WFC is true or false has no bearing on the question of whether my assumption implies it). Rather, I’d prefer to focus on the question of how your proposed tree would deal with that case.
According to you, what would the tree produced via your argument look like, and how would it be a valid tree-computation of what Plato had for lunch?
Had I considered this argument in the context of my original post, I would have rejected it on the grounds that the opponent can object by other means.
This is why I prefer the version of debate outlined here, where both sides make a claim and then each side must recurse down on the other’s arguments. I didn’t realize you were considering a version where you don’t have to specifically rebut the other player’s arguments.
Generally speaking, I didn’t have the impression that these more complex setups had significantly different properties with respect to my primary concerns. This could be wrong. But in particular, I don’t see that that setup forces specific rebuttal, either:
At the beginning of each round, one debater is defending a claim and the other is objecting to it. [...]
Each player then simultaneously may make any number of objections to the other player’s argument. [...]
If there are any challenged objections and the depth limit is >0, then we choose one challenged objection to recurse on:
We don’t define how to make this choice, so in order to be conservative we’re currently allowing the malicious debater to choose which to recurse on.
(Emphasis added.) So it seems to me like a dishonest player still can, in this system, focus on building up their own argument rather than pointing out where they think their opponent went wrong. Or, even if they do object, they can simply choose to recurse on the honest player’s objections instead (so that they get to explore their own infinite argument tree, rather than the honest, bounded tree of their opponent).
So, I propose as a concrete counterexample to your argument:
Q: What did Plato have for lunch two days before he met Socrates? (Suppose for the sake of argument that these two men existed, and met.) A: Fish. (Suppose for the sake of argument that this is factually true, but cannot be known to us by any argument.)
Ah, I see what you mean now. Yeah, I agree that debate is not going to answer fish in the scenario above. Sorry for using “correct” in a confusing way.
When I say that you get the correct answer, or the honest answer, I mean something like “you get the one that we would want our AI systems to give, if we knew everything that the AI systems know”. An alternative definition is that the answer should be “accurately reporting what humans would justifiably believe given lots of time to reflect” rather than “accurately corresponding to reality”.
(The two definitions above come apart when you talk about questions that the AI system knows about but can’t justify to humans, e.g. “how do you experience the color red”, but I’m ignoring those questions for now.)
(I’d prefer to talk about “accurately reporting the AI’s beliefs”, but there’s no easy way to define what beliefs an AI system has, and also in any case debate .)
In the example you give, the AI systems also couldn’t reasonably believe that the answer is “fish”, and so the “correct” / “honest” answer in this case is “the question can’t be answered given our current information”, or “the best we can do is guess the typical food for an ancient Greek diet”, or something along those lines. If the opponent tried to dispute this, then you simply challenge them to do better; they will then fail to do so. Given the assumption of optimal play, this absence of evidence is evidence of absence, and you can conclude that the answer is correct.
So it seems to me like a dishonest player still can, in this system, focus on building up their own argument rather than pointing out where they think their opponent went wrong.
In this case they’re acknowledging that the other player’s argument is “correct” (i.e. more likely than not to win if we continued recursively debating). While this doesn’t guarantee their loss, it sure seems like a bad sign.
Or, even if they do object, they can simply choose to recurse on the honest player’s objections instead (so that they get to explore their own infinite argument tree, rather than the honest, bounded tree of their opponent).
Yes, I agree this is true under those specific rules. But if there was a systematic bias in this way, you could just force exploration of both player’s arguments in parallel (at only 2x the cost).
When I say that you get the correct answer, or the honest answer, I mean something like “you get the one that we would want our AI systems to give, if we knew everything that the AI systems know”. An alternative definition is that the answer should be “accurately reporting what humans would justifiably believe given lots of time to reflect” rather than “accurately corresponding to reality”.
Right, OK.
So my issue with using “correct” like this in the current context is that it hides too much and creates a big risk of conflation. By no means do I assume—or intend to argue—that my debate setup can correctly answer every question in the sense above. Yet, of course, I intend for my system to provide “correct answers” in some sense. (A sense which has less to do with providing the best answer possible from the information available, and more to do with avoiding mistakes.)
If I suppose “correct” is close to has an honest argument which gives enough information to convince a human (let’s call this correctabram), then I would buy your original argument. Yet this would do little to connect my argument to factored cognition.
If I suppose “correct” is close to what HCH would say (correctpaul) then I still don’t buy your argument at all, for precisely the same reason that I don’t buy the version where “correct” simply means “true”—namely, because correctpaul answers don’t necessarily win in my debate setup, any more than correcttrue answers do.
Of course neither of those would be very sensible definitions of “correct”, since either would make the WFC claim uninteresting.
Let’s suppose that “correct” at least includes answers which an ideal HCH would give (IE, assuming no alignment issues with HCH, and assuming the human uses pretty good question-answering strategies). I hope you think that’s a fair supposition—your original comment was trying to make a meaningful statement about the relationship between my thing and factored cognition, so it seems reasonable to interpret WFC in that light.
I furthermore suppose that actual literal PSPACE problems can be safely computed by HCH. (This isn’t really clear, given safety restrictions you’d want to place on HCH, but we can think about that more if you want to object.)
So my new counterexample is PSPACE problems. Although I suppose an HCH can answer such questions, I have no reason to think my proposed debate system can. Therefore I think the tree you propose (which iiuc amounts to a proof of “A is never fully defeated”) won’t systematically be correct (A may be defeated by virtue of its advocate not being able to provide the human with enough reason to think it is true).
---
Other responses:
In this case they’re acknowledging that the other player’s argument is “correct” (i.e. more likely than not to win if we continued recursively debating). While this doesn’t guarantee their loss, it sure seems like a bad sign.
In this position, I would argue to the judge that not being able to identify specifically which assumption of my opponent’s is incorrect does not indicate concession, precisely because my opponent may have a complex web of argumentation which hides the contradiction deep in the branches or pushes it off to infinity.
Yes, I agree this is true under those specific rules. But if there was a systematic bias in this way, you could just force exploration of both player’s arguments in parallel (at only 2x the cost).
Agreed—I was only pointing out that the setup you linked didn’t have the property you mentioned, not that it would be particularly hard to get.
Re: correctness, I think I actually misled you with my last comment; I lost track of the original point. I endorse the thing I said as a definition of what I’m usually hoping for with debate, but I don’t think that was the definition I was using here.
I think in this comment thread I’ve been defining an honest answer as one that can be justified via arguments that eventually don’t have any defeaters. I thought this was what you were going for since you started with the assumption that dishonest answers always have defeaters—while this doesn’t strictly imply my definition, that just seemed like the obvious theoretical model to be using. (I didn’t consciously realize I was making that assumption.)
I still think that working with this “definition” is an interesting theoretical exercise, though I agree it doesn’t correspond to reality. Looking back I can see that you were talking about how this “definition” doesn’t actually correspond to the realistic situation, but I didn’t realize that’s what you were saying, sorry about that.
I think in this comment thread I’ve been defining an honest answer as one that can be justified via arguments that eventually don’t have any defeaters. I thought this was what you were going for since you started with the assumption that dishonest answers always have defeaters—while this doesn’t strictly imply my definition, that just seemed like the obvious theoretical model to be using. (I didn’t consciously realize I was making that assumption.)
Right, I agree—I was more or less taking that as a definition of honesty. However, this doesn’t mean we’d want to take it as a working definition of correctness, particularly not for WFC.
Re: correctness, I think I actually misled you with my last comment; I lost track of the original point. I endorse the thing I said as a definition of what I’m usually hoping for with debate, but I don’t think that was the definition I was using here.
It sounds like you are saying you intended the first case I mentioned in my previous argument, IE:
If I suppose “correct” is close to has an honest argument which gives enough information to convince a human (let’s call this correctabram), then I would buy your original argument. Yet this would do little to connect my argument to factored cognition.
Do you agree with my conclusion that your argument would, then, have little to do with factored cognition? (If so, I want to edit my first reply to you to summarize the eventual conclusion of this and other parts of the discussion, to make it easier on future readers—so I’m asking if you agree with that summary.)
To elaborate: the “correctabram version” of WFC says, essentially, that NP-like problems (more specifically: informal questions whose answers have supporting arguments which humans can verify, though humans may also incorrectly verify wrong answers/arguments) have computation trees which humans can inductively verify.
This is at best a highly weakened version of factored cognition, and generally, deals with a slightly different issue (ie tries to deal with the problem of verifying incorrect arguments).
I still think that working with this “definition” is an interesting theoretical exercise, though I agree it doesn’t correspond to reality. Looking back I can see that you were talking about how this “definition” doesn’t actually correspond to the realistic situation, but I didn’t realize that’s what you were saying, sorry about that.
I think you are taking this somewhat differently than I am taking this. The fact that correctabram doesn’t serve as a plausible notion of “correctness” (in your sense) and that honestabram doesn’t serve as a plausible notion of “honesty” (in the sense of getting the AI system to reveal all information it has) isn’t especially a crux for the applicability of my analysis, imho. My crux is, rather, the “no indescribably bad argument” thesis.
If bad arguments are always describably bad, then it’s plausible that some debate method could systematically avoid manipulation and perform well, even if stronger factored-cognition type theses failed. Which is the main point here.
If bad arguments are always describably bad, then it’s plausible that some debate method could systematically avoid manipulation and perform well, even if stronger factored-cognition type theses failed. Which is the main point here.
I think you also need that at least some of the time good arguments are not describably bad (i.e. they don’t have defeaters); otherwise there is no way to distinguish between good and bad arguments. (Or you need to posit some external-to-debate method of giving the AI system information about good vs bad arguments.)
Do you agree with my conclusion that your argument would, then, have little to do with factored cognition?
I think I’m still a bit confused on the relation of Factored Cognition to this comment thread, but I do agree at least that the main points we were discussing are not particularly related to Factored Cognition. (In particular, the argument that zero-sum is fine can be made without any reference to Factored Cognition.) So I think that summary seems fine.
I think you also need that at least some of the time good arguments are not describably bad
While I agree that there is a significant problem, I’m not confident I’d want to make that assumption.
As I mentioned in the other branch, I was thinking of differences in how easy lies are to find, rather than existence. It seems natural to me to assume that every individual thing does have a convincing counterargument, if we look through the space of all possible strings (not because I’m sure this is true, but because it’s the conservative assumption—I have no strong reason to think humans aren’t that hackable, even if we are less vulnerable to adversarial examples in some sense).
So my interpretation of “finding the honest equilibrium” in debate was, you enter a regime where the the (honest) debate strategies are too powerful, such that small mutations toward lying are defeated because they’re not lying well.
All of this was an implicit model, not a carefully thought out position on my part. Thus, I was saying things like “50% probability the opponent finds a plausible lie” which don’t make sense as an equilibrium analysis—in true equilibrium, players would know all the plausible lies, and know their opponents knew them, etc.
But, this kind of uncertainty still makes sense for any realistic level of training.
Furthermore, one might hope that the rational-player perspective (in which the risks and rewards of lying are balanced in order to determine whether to lie) simply doesn’t apply, because in order to suddenly start lying well, a player would have to invent the whole art of lying in one gradient descent step. So, if one is sufficiently stuck in an honesty “basin”, one cannot jump over the sides, even if there are perfectly good plays which involve doing so. I offer this as the steelman of the implicit position I had.
Overall, making this argument more explicit somewhat reduces my credulity in debate, because:
I was not explicitly recognizing that talk of “honest equilibrium” relies on assumptions about misleading counterarguments not existing, as opposed to weaker assumptions about them being hard to find (I think this also applies to regular debate, not just my framework here)
Steelmanning “dishonest arguments are harder to make” as an argument about training procedures, rather than about equilibrium, seems to rest on assumptions which would be difficult to gain confidence in.
-2/+1 Scoring
It’s worth explicitly noting that this weakens my argument for the −2/+1 scoring.
I was arguing that although −2/+1 can seriously disadvantage honest strategies in some cases (as you mention, it could mean the first player can lie, and the second player keeps silent to avoid retribution), it fixes a problem within the would-be honest attractor basin. Namely, I argued that it cut off otherwise problematic cases where dishonest players can force a tie (in expectation) by continuing to argue forever.
Now, the assumptions under which this is a problem are somewhat complex (as we’ve discussed). But I must assume there is a seeming counterargument to almost anything (at least, enough that the dishonest player can steer toward conversational territory in which this is true). Which means we can’t be making an argument about the equilibrium being good. Therefore, if this concern is relevant for us, we must be arguing about training rather than equilibrium behavior. (In the sense I discussed above.)
But if we’re arguing about training, we hopefully still have some assumption about lies being harder to find (during training). So, there should already be some other way to argue that you can’t go on dishonestly arguing forever.
So the situation would have to be pretty weird for −2/+1 to be useful.
(I don’t by any means intend to say that “a dishonest player continuing to argue in order to get a shot at not losing” isn’t a problem—just that if it’s a problem, it’s probably not a problem −2/+1 scoring can help with.)
Yeah all of this makes sense to me; I agree that you could make an argument about the difference in difficulty of finding defeaters to good vs. bad arguments, and that could then be used to say “debate will in practice lead to honest policies”.
I don’t think that’s what I did? Here’s what I think the structure of my argument is:
Every dishonest argument has a defeater. (Your assumption.)
Debaters are capable of finding a defeater if it exists. (I said “the best counterargument” before, but I agree it can be weakened to just “any defeater”. This doesn’t feel that qualitatively different.)
1 and 2 imply the Weak Factored Cognition hypothesis.
I’m not assuming factored cognition, I’m proving it using your assumption.
Possibly your worry is that the argument trees will never terminate, because every honest defeater could still have a dishonest defeater? It is true that I do need an additional assumption of some sort to ensure termination. Without that assumption, honesty becomes one of multiple possible equilibria (but it is still an equilibrium).
This is in fact what I usually take away from it. The point is to gain intuition about how “strongly” you amplify the original human’s capabilities.
I also agree with this; does anyone think it is proving something about the safety properties of debate w.r.t messy situations?
This seems good; I think probably I don’t get what exactly you’re arguing. (Like, what’s the model of human fallibility where you don’t access NP in one step? Can the theoretical-human not verify witnesses? What can the theoretical-human verify, that lets them access NP in multiple timesteps but not one timestep?)
I agree that you get a “clawing on to the argument in hopes of winning” effect, but I don’t see why that changes the equilibrium away from honesty. Just because a dishonest debater would claw on doesn’t mean that they’d win. The equilibrium is defined by what makes you win.
I can buy that in practice due to messiness you find worse situations where the AI systems sometimes can’t find the honest answer and instead finds that making up BS has a better chance of winning, and so it does that; but that’s not about the equilibrium, and it sounded to me like you were talking about the equilibrium.
I mean, I tried to give one (see response to your first point; I’m not assuming the Factored Cognition hypothesis). I’m not sure what’s unconvincing about it.
Thanks for taking the time to reply!
Ah, interesting, I didn’t catch that this is what you were trying to do. But how are you arguing #3? Your original comment seems to be constructing a tree computation for my debate, which is why I took it for an argument that my thing can be computed within factored cognition, not vice versa.
I think maybe what you’re trying to argue is that #1 and #2 together imply that we can root out dishonest arguments (at least, in the honest equilibrium), which I would agree with—and then you’re suggesting that this means we can recognize good arguments in the factored-cognition sense of good (IE arguments supported by a FC tree)? But I don’t yet see the implication from rooting out dishonest arguments to being able to recognize arguments that are valid in FC terms.
Perhaps an important point is that by “dishonest” I mean manipulative, ie, arguments which appear valid to a human on first reading them but which are (in some not-really-specified sense) bad. So, being able to root out dishonest arguments just means we can prevent the human from being improperly convinced. Perhaps you are reading “dishonest” to mean “invalid in an FC sense”, ie, lacking an FC tree. This is not at all what I mean by dishonest. Although we might suppose dishonestme implies dishonestFC, this supposition still would not make your argument go through (as far as I am seeing), because the set of not-dishonestme arguments would still not equal the set of FC-valid arguments.
If you did mean for “honest” to be defined as “has a supporting FC tree”, my objection to your argument quoted above would be that #1 is implausibly strong, since it requires that any flaw in a tree can be pointed out in a single step. (Analogically, this is assuming PSPACE=NP.)
I mean, that’s a concern I have, but not necessarily wrt the argument above. (Unless you have a reason why it’s relevant.)
Based on what argument? Is this something from the original debate paper that I’m forgetting?
Fair question. Possibly it’s just my flawed assumption about why the analogy was supposed to be interesting. I assumed people were intending the PSPACE thing as evidence about what would happen in messier situations.
My model is like this:
Imagine that we’re trying to optimize a travelling salesman route, using an AI advice system. However, whenever the AI says “democratic” or “peaceful” or other such words, the human unthinkingly approves of the route, without checking the claimed distance calculation.
This is, of course, a little absurd, but similar effects have been observed in experiments.
I’m then making the further assumption that humans can correct these errors when they’re explained sufficiently well.
That’s my model; the proposal in the post lives or dies on its merits.
The point of the “clawing” argument is that it’s a rational deviation from honesty, so it means honesty isn’t an equilibrium. It’s a 50⁄50 chance of winning (whoever gets the last word), which is better than a sure failure (in the case that a player has exhausted its ability to honestly argue).
Granted, there may be zero-sum rules which nonetheless don’t allow this. I’m only saying that I didn’t see how to avoid it with zero-sum scoring.
I remain curious to hear your clarification wrt that (specifically, how you justify point #3). However, if that argument went through, how would that also be an argument that the same thing can be accomplished with a zero-sum set of rules?
Based on your clarification, my current understanding of what that argument tries to accomplish is “I’m not assuming factored cognition, I’m proving it using your assumption.” How would establishing that help establish a set of zero sum rules which have an honest equilibrium?
There are two arguments:
Your assumption + automatic verification of questions of the form “What is the best defeater to X” implies Weak Factored Cognition (which as defined in my original comment is of the form “there exists a tree such that...” and says nothing about what equilibrium we get).
Weak Factored Cognition + debate + human judge who assumes optimal play implies an honest equilibrium. (Maybe also: if you assume debate trees terminate, then the equilibrium is unique. I think there’s some subtlety here though.)
In my previous comment, I was talking about 1, and taking 2 for granted. This is all in the zero-sum setting. But let’s leave that aside and instead talk about a simpler argument that doesn’t talk about Factored Cognition at all.
----
Zero-sum setting, argument that honesty is an equilibrium (for the first player in a turn-by-turn game, or either player in a simultaneous-action game):
If you are always honest, then whenever you can take an action, there will exist a defeater (by your assumption), therefore you will have at least as many options as any non-honest policy (which may or may not have a defeater). Therefore you maximize your value by being honest.
Additional details:
In the case where arguments never terminate (every argument, honest or not, has a defeater), then being dishonest will also leave you with many options, and so that will also be an equilibrium. When arguments do terminate quickly enough (maximum depth of the game tree is less than the debate length), that ensures that the honest player always gets the “last word” (the point at which a dishonest defeater no longer exists), and so honesty always wins and is the unique equilibrium. In the middle, where most arguments terminate quickly but some go on forever, honesty is usually incentivized, but sometimes it can be swapped out for a dishonest strategy that achieves the same value.
I think this is only true when you have turn-by-turn play and your opponent has already “claimed” the honest debater role. In this case I’d say that an equilibrium is for the first player to be honest and the second player to do whatever is necessary to have a chance at success. Still seems like you can use the first player AI in this situation.
In the simultaneous play setting, I think you expect both agents to be honest.
More broadly, I note that the “clawing” argument only applies when facing an honest opponent. Otherwise, you should just use honest counterarguments.
I also don’t really understand the hope in the non-zero-sum case here—in the non-zero-sum setting, as you mention the first player can be dishonest, and then the second player concedes rather than giving an honest defeater that will then be re-defeated by the first (dishonest) player. This seems like worse behavior than is happening under the zero-sum case.
Got it, that makes sense. I see better now why you’re saying one-step debate isn’t an NP oracle.
I think my arguments in the original comment do still work, as long as you enforce that the judge never verifies an argument without first asking the subquestion “What is the best defeater to this argument?”
Yeah, I was assuming turn-by-turn play.
This is a significant point that I was missing: I had assumed that in simultaneous play, the players would randomize, so as to avoid choosing the same answer, since choosing the same answer precludes winning. However, if choosing a worse answer means losing, then players prefer a draw.
But I’m not yet convinced, because there’s still the question of whether choosing the worse answer means losing. The “clawing” argument still suggests that choosing the worse answer may yield a draw (in expectation), even in simultaneous play. (IE, what if the should-be loser attacks the winner, and they go back and forth, with winner depending on last word?)
Ah, I suppose this is still consistent with honesty being an equilibrium. But it would then be a really weak sort of equilibrium—there would be no reason to be honest, but no specific reason to be dishonest, either.
There always exists an honest defeater to dishonest arguments. But, never to honest arguments. (I should have explicitly assumed this.) Therefore, you are significantly tying your hands by being honest: you don’t have a way to refute honest arguments. (Which you would like to do, since in the zero-sum setting, this may be the only way to recover points.)
I assume (correct me if I’m wrong) that the scoring rules to “the zero sum setting” are something like: the judge assesses things at the end, giving +1 to the winner and −1 from the loser, or 0 in case of a tie.
Then I concede that there is an honest equilibrium where the first player tells the truth, and the second player concedes (or, in simultaneous play, both players tell the truth and then concede). However, it does seem to be an extremely weak equilibrium—the second player is equally happy to lie, starting a back-and-forth chain which is a tie in expectation.
It seems plausible to me that there’s an incremental zero-sum scoring rule; EG, every convincing counterargument takes 1 point from the other player, so any dishonest statement is sure to lose you a point (in equilibrium). The hope would be that you always prefer to concede rather than argue, even if you’re already losing, in order to avoid losing more points.
However, this doesn’t work, because a dishonest (but convincing) argument gives you +1, and then −1 if it is refuted; so at worst it’s a wash. So again it’s a weak equilibrium, and if there’s any imperfection in the equilibrium at all, it actively incentivises lying when you would otherwise concede (because you want to take the chance that the opponent will not manage to refute your argument).
This was the line of reasoning which led me to the scoring rule in the post, since making it a −2 (but still only +1 for the other player) solves that issue.
I agree that if we assume honesty eventually wins if arguments are long enough (IE, eventually you get to an honest argument which has no dishonest defeater), then there would be an honest equilibrium, and no dishonest equilibrium.
Ahhh, this is actually a pretty interesting point, because it almost suggests that honesty is an Evolutionarily Stable Equilibrium, even though it’s only a Weak Nash Equilibrium. But I think that’s not quite true, since the strategy “lie when you would otherwise have to concede, but otherwise be honest” can invade the honest equilibrium. (IE that mutation would not be selected against, and could be actively selected for if we’re not quite in equilibrium, since players might not be quite perfect at finding the honest refutations for all lies.)
You’re right, that’s really bad. The probability of the opponent finding (and using) a dishonest defeater HAS TO be below 50%, in all cases, which is a pretty high bar. Although of course we can make an argument about how that probability should be below 50% if we’re already in an honest-enough regime. (IE we hope that the dishonest player prefers to concede at that point rather than refute the refutation, for the same reason as your argument gives—it’s too afraid of the triple refutation. This is precisely the argument we can’t make in the zero sum case.)
Whoops, I seem to have missed this comment, sorry about that. I think at this point we’re nearly at agreement.
Yeah, I agree this is possible. (The reason to not expect dishonesty is that sometimes you’ll see honest arguments to which there is no dishonest defeater.)
Similar comment here—the more you expect that honest claims will likely have dishonest defeaters, the weaker you expect the equilibrium to be. (E.g. it’s clearly not a tie when honest claims never have dishonest defeaters; in this case first player always wins.)
On the specific −2/+1 proposal, the issue is that then the first player just makes some dishonest argument, and the second player concedes because even if they give an honest defeater, the second player could then re-defeat that with a dishonest defeater. (I realize I’m just repeating myself here; there’s more discussion in the next section.)
But more broadly, I claim that given your assumptions there is no possible scoring rule that (in the worst case) makes honesty a unique equilibrium. This worst case is when every argument has a defeater (and in particular, every honest argument has a dishonest defeater).
In this situation, there is no possible way to distinguish between honesty and dishonesty—under your assumptions, the thing that characterizes honesty is that honest arguments (at least sometimes) don’t have defeaters. From the perspective of the players, the salient feature of the game is that they can make statements; all such statements will have defeaters; there’s no information available to them in the structure of the game that distinguishes honesty from dishonesty. Therefore honesty can’t be the unique equilibrium; whatever the policy is, there should be an equivalent one that is at least sometimes dishonest.
In this worst case, I suspect that for any judge-based scoring rule, the equilibrium behavior is either “the first player says something and the second concedes”, or “every player always provides some arbitrary defeater of the previous statement, and the debate never ends / the debate goes to whoever got the last word”.
Sorry, I don’t get this. How could we make the argument that the probability is below 50%?
Depending on the answer, I expect I’d follow up with either
Why can’t the same argument apply in the zero sum case? or
Why can’t the same argument be used to say that the first player is happy to make a dishonest claim? or
Why is it okay for us to assume that we’re in an honest-enough regime?
Separately, I’d also want to understand how exactly we’re evading the argument I gave above about how the players can’t even distinguish between honesty and dishonesty in the worst case.
----
Things I explicitly agree with:
and
I think my analysis there was not particularly good, and only starts to make sense if we aren’t yet in equilibrium.
I think #3 is the most reasonable, with the answer being “I have no reason why that’s a reasonable assumption; I’m just saying, that’s what you’d usually try to argue in a debate context...”
(As I stated in the OP, I have no claims as to how to induce honest equilibrium in my setup.)
I agree that we are now largely in agreement about this branch of the discussion.
My (admittedly conservative) supposition is that every claim does have a defeater which could be found by a sufficiently intelligent adversary, but, the difficulty of finding such claims can be much higher than finding honest ones.
Yep, makes sense. So nothing distinguishes between an honest equilibrium and a dishonest one, for sufficiently smart players.
There is still potentially room for guarantees/arguments about reaching honest equilibria (in the worst case) based on the training procedure, due to the idea that the honest defeaters are easier to find.
Right, of course, that makes more sense. However, I’m still feeling dense—I still have no inkling of how you would argue weak factored cognition from #1 and #2. Indeed, Weak FC seems far too strong to be established from anything resembling #1 and #2: WFC says that for any question Q with a correct answer A, there exists a tree. In terms of the computational complexity analogy, this is like “all problems are PSPACE”. Presumably you intended this as something like an operational definition of “correct answer” rather than an assertion that all questions are answerable by verifiable trees? In any case, #1 and #2 don’t seem to imply anything like “for all questions with a correct answer...”—indeed, #2 seems irrelevant, since it is about what arguments players can reliably find, not about what the human can verify.
I’ll just flag that I still don’t know this argument, either, and I’m curious where you’re getting it from / what it is. (I have a vague recollection that this argument might have been explained to me in some other comment thread about debate, but, I haven’t found it yet.) But, you understandably don’t focus on articulating your arguments 1 or 2 in the main body of your comment, instead focusing on other things. I’ll leave this comment as a thread for you to articulate those two arguments further if you feel up to it, and make another comment to reply to the bulk of your comment.
I just read the Factored Cognition sequence since it has now finished, and this post derives WFC as the condition necessary for honesty to be an equilibrium in (a slightly unusual form of) debate, under the assumption of optimal play.
Great, thanks!
The computational complexity analogy version would have to put a polynomial limit on the depth of the tree if you wanted to argue that the problem is in PSPACE. My construction doesn’t do this; there will be questions where the depth of the tree is super-polynomial, but the tree still exists. (These will be the cases in which, even under optimal play by an honest agent, the “length” of a chain of defeaters can be super-polynomially large.) So I don’t think my argument is proving too much.
(The tree could be infinite if you don’t have an assumption that guarantees termination somehow, hence my caveats about termination. WFC should probably ask for the existence of a finite tree.)
For the actual argument, I’ll refer back to my original comment, which provides a procedure to construct the tree. Happy to clarify whichever parts of the argument are confusing.
No, I am in fact asserting that given the two assumptions, all questions are answerable by (potentially super-polynomially large) verifiable trees (again assuming we deal with termination somehow).
I think it differs based on what assumptions you make on the human judge, so there isn’t a canonical version of it. In this case, the assumption on the human judge is that if the subanswers they are given are true, then they never verify an incorrect overall answer. (This is different from the “defeaters” assumption you have, for which I’d refer to the argument I gave above.)
Claim: In a turn-by-turn unlimited-length debate, if the first player is honest, then the first player always wins in equilibrium.
Argument: By WFC, we assume there is a finite tree T that can be verified. The first player then has the following strategy: take the question under consideration (initially the original question; later it is whatever subquestion the opponent is disputing). Report “the answer is A, which because the answer to subquestion 1 is A1 and the answer to subquestion 2 is A2”.
The opponent will always have to recurse into one of the subclaims (or concede). This brings us one step closer to leaf nodes. Eventually (if the opponent never concedes), we get to a leaf node which the judge then verifies in favor of the honest first player. ∎
Corollary: For the first player, honesty is an equilibrium policy.
Argument: By the claim above, the first player can never do any better than honesty (you can’t do better than always winning).
In a simultaneous-play unlimited-length debate, a similar argument implies at least a 50-50 chance of winning via honesty, which must be the minimax value (since the game is symmetric and zero-sum), and therefore honesty is an equilibrium policy.
----
Once you go to finite-length debates, then things get murkier and you have to worry about arguments that are too long to get to leaf nodes (this is essentially the computationally bounded version of the termination problem). The version of WFC that would be needed is “for every question Q, there is a verifiable tree T of depth at most N showing that the answer is A”; that version of WFC is presumably not true.
OK, but this just makes me regret pointing to the computational complexity analogy. You’re still purporting to prove “for any question with a correct answer, there exists a tree” from assumptions which don’t seem strong enough to say much about all correct answers.
Looking back again, it still seems like what you are trying to do in your original argument is something like point out that optimal play (within my system) can be understood via a tree structure. But this should only establish something like “any question which my version of debate can answer has a tree”, not “any question with a correct answer has a tree”. There is no reason to think that optimal play can correctly answer all questions which have a correct answer.
It seems like what you are doing in your argument is essentially conflating “answer” with “argument”. Just because A is the correct answer to Q does not mean there are any convincing arguments for it.
For generic question Q and correct answer A, I make no assumption that there are convincing arguments for A one way or the other (honest or dishonest). If player 1 simply states A, player 2 would be totally within rights to say “player 1 offers no argument for its position” and receive points for that, as far as I am concerned.
Thus, when you say:
I would say: no, B may be a perfectly valid response to A, with no defeaters, even if A is true and correctly answers Q.
Another problem with your argument—WFC says that all leaf nodes are human-verifiable, whereas some leaf nodes in your suggested tree have to be taken on faith (a fact which you mention, but don’t address).
The “in equilibrium” there must be unnecessary, right? If the first player always wins in equilibrium but might not otherwise, then the second player has a clear incentive to make sure things are not in equilibrium (which is a contradiction).
I buy the argument given some assumptions. I note that this doesn’t really apply to my setting, IE, we have to do more than merely change the scoring to be more like the usual debate scoring.
In particular, this line doesn’t seem true without a further assumption:
Had I considered this argument in the context of my original post, I would have rejected it on the grounds that the opponent can object by other means. For example,
User: What is 2+2?
Player 1: 2+2 is 4. I break down the problem into ‘what is 2-1’ (call it x), ‘what is 2+1’ (call it y), and ‘what is x+y’. I claim x=1, y=3, and x+y=4. Clearly, if all three of these are true, then 2+2=4, since I’ve only added 1 and subtracted 1, so x+y must equal 2+2.
Player 2: 2+2 is 5, though. This is because 2+3 is 6, and 3 is 1 more than 2, so, 2+2 must be 1 less than 6. But 5 is 1 less than 6.
Player 1: If my argument is wrong, which of my assumptions is wrong?
Player 2: I don’t know. Perhaps you have a huge argument tree which I would have to spend a long time examining. I can tell something is wrong, however, thanks to my argument. If you think it should always be possible to point out which specific assumption is incorrect, which of my assumptions do you think is incorrect?
Clearly, if Player 2 is allowed to object by other means like this, Player 2 would greatly prefer to—Player 2 wants to avoid descending Player 1′s argument tree if at all possible.
If successful, Player 2 gets Player 1 to descend Player 2′s infinite tree (which continues to decompose the problem via the same strategy as above), thus never finding the contradiction.
Player 1 can of course ask Player 2 how long the argument tree will be, which does put Player 2 at risk of contradiction in the infinite debate setting. But if debates are finite (but unknown length), Player 2 can claim a large size that makes the contradiction difficult to uncover. Or, Player 2 could avoid answering the question (which seems possible if the players are free to choose which parts of the argument to prioritize in giving their responses).
So I buy your argument under the further assumption that the argument must recurse on Player 1′s claims (rather than allowing Player 2 to make an alternative argument which might get recursed on instead). Or, in a true infinite-debate setting, provided that there’s also a way to force opponents to answer questions (EG the judge assumes you’re lying if you repeatedly dodge a question).
I think at this point I want a clearer theoretical model of what assumptions you are and aren’t making. Like, at this point, I’m feeling more like “why are we even talking about defeaters; there are much bigger issues in this setup”.
I wouldn’t be surprised at this point if most of the claims I’ve made are actually false under the assumptions you seem to be working under.
Not sure what you want me to “address”. The leaf nodes that are taken on faith really are true under optimal play, which is what happens at equilibrium.
This is why I prefer the version of debate outlined here, where both sides make a claim and then each side must recurse down on the other’s arguments. I didn’t realize you were considering a version where you don’t have to specifically rebut the other player’s arguments.
I just meant to include the fact that the honest player is able to find the defeaters to dishonest arguments. If you include that in “the honest policy”, then I agree that “in equilibrium” is unnecessary. (I definitely could have phrased that better.)
To focus on this part, because it seems quite tractable --
Let’s grant for the sake of argument that these nodes are true under optimal play. How can the human verify that? Optimal play is quite a computationally complex object.
WFC as you stated it says that these leaf nodes are verifiable:
So the tree you provide doesn’t satisfy this condition. Yet you say:
To me this reads like “this would satisfy WFC if WFC allowed humans to take leaf nodes on faith, rather than verify them”.
Am I still misunderstanding something big about the kind of argument you are trying to make?
I don’t think so, but to formalize the argument a bit more, let’s define this new version of the WFC:
Special-Tree WFC: For any question Q with correct answer A, there exists a tree of decompositions T arguing this such that:
Every internal node has exactly one child leaf of the form “What is the best defeater to X?” whose answer is auto-verified,
For every other leaf node, a human can verify that the answer to the question at that node is correct,
For every internal node, a human can verify that the answer to the question is correct, assuming that the subanswers are correct.
(As before, we assume that the human never verifies something incorrect, unless the subanswers they were given were incorrect.)
Claim 1: (What I thought was) your assumption ⇒ Special-Tree WFC, using the construction I gave.
Claim 2: Special-Tree WFC + assumption of optimal play ⇒ honesty is an equilibrium, using the same argument that applies to regular WFC + assumption of optimal play.
Idk whether this is still true under the assumptions you’re using; I think claim 1 in particular is probably not true under your model.
Ah, OK, so you were essentially assuming that humans had access to an oracle which could verify optimal play.
This sort of makes sense, as a human with access to a debate system in equilibrium does have such an oracle. I still don’t yet buy your whole argument, for reasons being discussed in another branch of our conversation, but this part makes enough sense.
Your argument also has some leaf nodes which use the terminology “fully defeat”, in contrast to “defeat”. I assume this means that in the final analysis (after expanding the chain of defeaters) this refutation was a true one, not something ultimately refuted.
If so, it seems you also need an oracle for that, right? Unless you think that can be inferred from some fact about optimal play. EG, that a player bothered to say it rather than concede.
In any case it seems like you could just make the tree out of the claim “A is never fully defeated”:
Node(Q, A, [Leaf("Is A ever fully defeated?", "No")])
I don’t think I ever use “fully defeat” in a leaf? It’s always in a
Node
, or in aTree
(which is a recursive call to the procedure that creates the tree).Yes, that’s what I mean by “fully defeat”.
Ahhhhh, OK. I missed that that was supposed to be a recursive call, and interpreted it as a leaf node based on the overall structure. So I was still missing an important part of your argument. I thought you were trying to offer a static tree in that last part, rather than a procedure.
An understandable response. Of course I could try to be more clear about my assumptions (and might do so).
But it seems to me that the current misunderstandings are mostly about how I was jumping off from the original debate paper (in which responses are a back-and-forth sequence, and players answer in unstructured text, with no rules except those the judge may enforce) whereas you were using more recent proposals as your jumping-off-point.
Moreover, rather than trying to go over the basic assumptions, I think we can make progress (at least on my side) by focusing narrowly on how your argument is supposed to go through for an example.
So, I propose as a concrete counterexample to your argument:
Q: What did Plato have for lunch two days before he met Socrates? (Suppose for the sake of argument that these two men existed, and met.) A: Fish. (Suppose for the sake of argument that this is factually true, but cannot be known to us by any argument.)
I propose that the tree you provided via your argument cannot be a valid tree-computation of what Plato had for lunch that day, because assertions about which player conceded, what statements have defeaters, etc. have little bearing on the question of what Plato had for lunch (because we simply don’t have enough information to establish this by any argument, no matter how large, and neither do the players). This seems to me like a big problem with your approach, not a finicky issue due to some misunderstanding of my assumptions about debate.
Surely it’s clear that, in general, not all correct answers have convincing arguments supporting them?
Again, this is why I was quick to assume that by “correct answer” you surely meant something weaker, eg an operational definition. Yet you insist that you mean the strong thing.
Not to get caught up arguing whether WFC is true (I’m saying it’s really clearly false as stated, but that’s not my focus—after all, whether WFC is true or false has no bearing on the question of whether my assumption implies it). Rather, I’d prefer to focus on the question of how your proposed tree would deal with that case.
According to you, what would the tree produced via your argument look like, and how would it be a valid tree-computation of what Plato had for lunch?
Generally speaking, I didn’t have the impression that these more complex setups had significantly different properties with respect to my primary concerns. This could be wrong. But in particular, I don’t see that that setup forces specific rebuttal, either:
(Emphasis added.) So it seems to me like a dishonest player still can, in this system, focus on building up their own argument rather than pointing out where they think their opponent went wrong. Or, even if they do object, they can simply choose to recurse on the honest player’s objections instead (so that they get to explore their own infinite argument tree, rather than the honest, bounded tree of their opponent).
Ah, I see what you mean now. Yeah, I agree that debate is not going to answer fish in the scenario above. Sorry for using “correct” in a confusing way.
When I say that you get the correct answer, or the honest answer, I mean something like “you get the one that we would want our AI systems to give, if we knew everything that the AI systems know”. An alternative definition is that the answer should be “accurately reporting what humans would justifiably believe given lots of time to reflect” rather than “accurately corresponding to reality”.
(The two definitions above come apart when you talk about questions that the AI system knows about but can’t justify to humans, e.g. “how do you experience the color red”, but I’m ignoring those questions for now.)
(I’d prefer to talk about “accurately reporting the AI’s beliefs”, but there’s no easy way to define what beliefs an AI system has, and also in any case debate .)
In the example you give, the AI systems also couldn’t reasonably believe that the answer is “fish”, and so the “correct” / “honest” answer in this case is “the question can’t be answered given our current information”, or “the best we can do is guess the typical food for an ancient Greek diet”, or something along those lines. If the opponent tried to dispute this, then you simply challenge them to do better; they will then fail to do so. Given the assumption of optimal play, this absence of evidence is evidence of absence, and you can conclude that the answer is correct.
In this case they’re acknowledging that the other player’s argument is “correct” (i.e. more likely than not to win if we continued recursively debating). While this doesn’t guarantee their loss, it sure seems like a bad sign.
Yes, I agree this is true under those specific rules. But if there was a systematic bias in this way, you could just force exploration of both player’s arguments in parallel (at only 2x the cost).
Right, OK.
So my issue with using “correct” like this in the current context is that it hides too much and creates a big risk of conflation. By no means do I assume—or intend to argue—that my debate setup can correctly answer every question in the sense above. Yet, of course, I intend for my system to provide “correct answers” in some sense. (A sense which has less to do with providing the best answer possible from the information available, and more to do with avoiding mistakes.)
If I suppose “correct” is close to has an honest argument which gives enough information to convince a human (let’s call this correctabram), then I would buy your original argument. Yet this would do little to connect my argument to factored cognition.
If I suppose “correct” is close to what HCH would say (correctpaul) then I still don’t buy your argument at all, for precisely the same reason that I don’t buy the version where “correct” simply means “true”—namely, because correctpaul answers don’t necessarily win in my debate setup, any more than correcttrue answers do.
Of course neither of those would be very sensible definitions of “correct”, since either would make the WFC claim uninteresting.
Let’s suppose that “correct” at least includes answers which an ideal HCH would give (IE, assuming no alignment issues with HCH, and assuming the human uses pretty good question-answering strategies). I hope you think that’s a fair supposition—your original comment was trying to make a meaningful statement about the relationship between my thing and factored cognition, so it seems reasonable to interpret WFC in that light.
I furthermore suppose that actual literal PSPACE problems can be safely computed by HCH. (This isn’t really clear, given safety restrictions you’d want to place on HCH, but we can think about that more if you want to object.)
So my new counterexample is PSPACE problems. Although I suppose an HCH can answer such questions, I have no reason to think my proposed debate system can. Therefore I think the tree you propose (which iiuc amounts to a proof of “A is never fully defeated”) won’t systematically be correct (A may be defeated by virtue of its advocate not being able to provide the human with enough reason to think it is true).
---
Other responses:
In this position, I would argue to the judge that not being able to identify specifically which assumption of my opponent’s is incorrect does not indicate concession, precisely because my opponent may have a complex web of argumentation which hides the contradiction deep in the branches or pushes it off to infinity.
Agreed—I was only pointing out that the setup you linked didn’t have the property you mentioned, not that it would be particularly hard to get.
Re: correctness, I think I actually misled you with my last comment; I lost track of the original point. I endorse the thing I said as a definition of what I’m usually hoping for with debate, but I don’t think that was the definition I was using here.
I think in this comment thread I’ve been defining an honest answer as one that can be justified via arguments that eventually don’t have any defeaters. I thought this was what you were going for since you started with the assumption that dishonest answers always have defeaters—while this doesn’t strictly imply my definition, that just seemed like the obvious theoretical model to be using. (I didn’t consciously realize I was making that assumption.)
I still think that working with this “definition” is an interesting theoretical exercise, though I agree it doesn’t correspond to reality. Looking back I can see that you were talking about how this “definition” doesn’t actually correspond to the realistic situation, but I didn’t realize that’s what you were saying, sorry about that.
Right, I agree—I was more or less taking that as a definition of honesty. However, this doesn’t mean we’d want to take it as a working definition of correctness, particularly not for WFC.
It sounds like you are saying you intended the first case I mentioned in my previous argument, IE:
Do you agree with my conclusion that your argument would, then, have little to do with factored cognition? (If so, I want to edit my first reply to you to summarize the eventual conclusion of this and other parts of the discussion, to make it easier on future readers—so I’m asking if you agree with that summary.)
To elaborate: the “correctabram version” of WFC says, essentially, that NP-like problems (more specifically: informal questions whose answers have supporting arguments which humans can verify, though humans may also incorrectly verify wrong answers/arguments) have computation trees which humans can inductively verify.
This is at best a highly weakened version of factored cognition, and generally, deals with a slightly different issue (ie tries to deal with the problem of verifying incorrect arguments).
I think you are taking this somewhat differently than I am taking this. The fact that correctabram doesn’t serve as a plausible notion of “correctness” (in your sense) and that honestabram doesn’t serve as a plausible notion of “honesty” (in the sense of getting the AI system to reveal all information it has) isn’t especially a crux for the applicability of my analysis, imho. My crux is, rather, the “no indescribably bad argument” thesis.
If bad arguments are always describably bad, then it’s plausible that some debate method could systematically avoid manipulation and perform well, even if stronger factored-cognition type theses failed. Which is the main point here.
I think you also need that at least some of the time good arguments are not describably bad (i.e. they don’t have defeaters); otherwise there is no way to distinguish between good and bad arguments. (Or you need to posit some external-to-debate method of giving the AI system information about good vs bad arguments.)
I think I’m still a bit confused on the relation of Factored Cognition to this comment thread, but I do agree at least that the main points we were discussing are not particularly related to Factored Cognition. (In particular, the argument that zero-sum is fine can be made without any reference to Factored Cognition.) So I think that summary seems fine.
While I agree that there is a significant problem, I’m not confident I’d want to make that assumption.
As I mentioned in the other branch, I was thinking of differences in how easy lies are to find, rather than existence. It seems natural to me to assume that every individual thing does have a convincing counterargument, if we look through the space of all possible strings (not because I’m sure this is true, but because it’s the conservative assumption—I have no strong reason to think humans aren’t that hackable, even if we are less vulnerable to adversarial examples in some sense).
So my interpretation of “finding the honest equilibrium” in debate was, you enter a regime where the the (honest) debate strategies are too powerful, such that small mutations toward lying are defeated because they’re not lying well.
All of this was an implicit model, not a carefully thought out position on my part. Thus, I was saying things like “50% probability the opponent finds a plausible lie” which don’t make sense as an equilibrium analysis—in true equilibrium, players would know all the plausible lies, and know their opponents knew them, etc.
But, this kind of uncertainty still makes sense for any realistic level of training.
Furthermore, one might hope that the rational-player perspective (in which the risks and rewards of lying are balanced in order to determine whether to lie) simply doesn’t apply, because in order to suddenly start lying well, a player would have to invent the whole art of lying in one gradient descent step. So, if one is sufficiently stuck in an honesty “basin”, one cannot jump over the sides, even if there are perfectly good plays which involve doing so. I offer this as the steelman of the implicit position I had.
Overall, making this argument more explicit somewhat reduces my credulity in debate, because:
I was not explicitly recognizing that talk of “honest equilibrium” relies on assumptions about misleading counterarguments not existing, as opposed to weaker assumptions about them being hard to find (I think this also applies to regular debate, not just my framework here)
Steelmanning “dishonest arguments are harder to make” as an argument about training procedures, rather than about equilibrium, seems to rest on assumptions which would be difficult to gain confidence in.
-2/+1 Scoring
It’s worth explicitly noting that this weakens my argument for the −2/+1 scoring.
I was arguing that although −2/+1 can seriously disadvantage honest strategies in some cases (as you mention, it could mean the first player can lie, and the second player keeps silent to avoid retribution), it fixes a problem within the would-be honest attractor basin. Namely, I argued that it cut off otherwise problematic cases where dishonest players can force a tie (in expectation) by continuing to argue forever.
Now, the assumptions under which this is a problem are somewhat complex (as we’ve discussed). But I must assume there is a seeming counterargument to almost anything (at least, enough that the dishonest player can steer toward conversational territory in which this is true). Which means we can’t be making an argument about the equilibrium being good. Therefore, if this concern is relevant for us, we must be arguing about training rather than equilibrium behavior. (In the sense I discussed above.)
But if we’re arguing about training, we hopefully still have some assumption about lies being harder to find (during training). So, there should already be some other way to argue that you can’t go on dishonestly arguing forever.
So the situation would have to be pretty weird for −2/+1 to be useful.
(I don’t by any means intend to say that “a dishonest player continuing to argue in order to get a shot at not losing” isn’t a problem—just that if it’s a problem, it’s probably not a problem −2/+1 scoring can help with.)
Yeah all of this makes sense to me; I agree that you could make an argument about the difference in difficulty of finding defeaters to good vs. bad arguments, and that could then be used to say “debate will in practice lead to honest policies”.