Rohin Shah comments on Debate Minus Factored Cognition

Rohin Shah 20 Jan 2021 21:52 UTC
LW: 2 AF: 2
AF
WFC says that for any question Q with a correct answer A, there exists a tree. In terms of the computational complexity analogy, this is like “all problems are PSPACE”
The computational complexity analogy version would have to put a polynomial limit on the depth of the tree if you wanted to argue that the problem is in PSPACE. My construction doesn’t do this; there will be questions where the depth of the tree is super-polynomial, but the tree still exists. (These will be the cases in which, even under optimal play by an honest agent, the “length” of a chain of defeaters can be super-polynomially large.) So I don’t think my argument is proving too much.
(The tree could be infinite if you don’t have an assumption that guarantees termination somehow, hence my caveats about termination. WFC should probably ask for the existence of a finite tree.)
For the actual argument, I’ll refer back to my original comment, which provides a procedure to construct the tree. Happy to clarify whichever parts of the argument are confusing.
Presumably you intended this as something like an operational definition of “correct answer” rather than an assertion that all questions are answerable by verifiable trees?
No, I am in fact asserting that given the two assumptions, all questions are answerable by (potentially super-polynomially large) verifiable trees (again assuming we deal with termination somehow).
I’ll just flag that I still don’t know this argument, either, and I’m curious where you’re getting it from / what it is.
I think it differs based on what assumptions you make on the human judge, so there isn’t a canonical version of it. In this case, the assumption on the human judge is that if the subanswers they are given are true, then they never verify an incorrect overall answer. (This is different from the “defeaters” assumption you have, for which I’d refer to the argument I gave above.)
Claim: In a turn-by-turn unlimited-length debate, if the first player is honest, then the first player always wins in equilibrium.
Argument: By WFC, we assume there is a finite tree T that can be verified. The first player then has the following strategy: take the question under consideration (initially the original question; later it is whatever subquestion the opponent is disputing). Report “the answer is A, which because the answer to subquestion 1 is A1 and the answer to subquestion 2 is A2”.
The opponent will always have to recurse into one of the subclaims (or concede). This brings us one step closer to leaf nodes. Eventually (if the opponent never concedes), we get to a leaf node which the judge then verifies in favor of the honest first player. ∎
Corollary: For the first player, honesty is an equilibrium policy.
Argument: By the claim above, the first player can never do any better than honesty (you can’t do better than always winning).
In a simultaneous-play unlimited-length debate, a similar argument implies at least a 50-50 chance of winning via honesty, which must be the minimax value (since the game is symmetric and zero-sum), and therefore honesty is an equilibrium policy.
----
Once you go to finite-length debates, then things get murkier and you have to worry about arguments that are too long to get to leaf nodes (this is essentially the computationally bounded version of the termination problem). The version of WFC that would be needed is “for every question Q, there is a verifiable tree T of depth at most N showing that the answer is A”; that version of WFC is presumably not true.
- abramdemski 26 Jan 2021 18:44 UTC
  LW: 2 AF: 2
  AF Parent
  
  The computational complexity analogy version would have to put a polynomial limit on the depth of the tree if you wanted to argue that the problem is in PSPACE. My construction doesn’t do this; there will be questions where the depth of the tree is super-polynomial, but the tree still exists. (These will be the cases in which, even under optimal play by an honest agent, the “length” of a chain of defeaters can be super-polynomially large.) So I don’t think my argument is proving too much.
  
  OK, but this just makes me regret pointing to the computational complexity analogy. You’re still purporting to prove “for any question with a correct answer, there exists a tree” from assumptions which don’t seem strong enough to say much about all correct answers.
  
  For the actual argument, I’ll refer back to my original comment, which provides a procedure to construct the tree. Happy to clarify whichever parts of the argument are confusing.
  
  Looking back again, it still seems like what you are trying to do in your original argument is something like point out that optimal play (within my system) can be understood via a tree structure. But this should only establish something like “any question which my version of debate can answer has a tree”, not “any question with a correct answer has a tree”. There is no reason to think that optimal play can correctly answer all questions which have a correct answer.
  
  It seems like what you are doing in your argument is essentially conflating “answer” with “argument”. Just because A is the correct answer to Q does not mean there are any convincing arguments for it.
  
  For generic question Q and correct answer A, I make no assumption that there are convincing arguments for A one way or the other (honest or dishonest). If player 1 simply states A, player 2 would be totally within rights to say “player 1 offers no argument for its position” and receive points for that, as far as I am concerned.
  
  Thus, when you say:
  
  Otherwise, let the best defeater to A be B, and let its best defeater be C. (By your assumption, C exists.)
  
  I would say: no, B may be a perfectly valid response to A, with no defeaters, even if A is true and correctly answers Q.
  
  Another problem with your argument—WFC says that all leaf nodes are human-verifiable, whereas some leaf nodes in your suggested tree have to be taken on faith (a fact which you mention, but don’t address).
  
  Claim: In a turn-by-turn unlimited-length debate, if the first player is honest, then the first player always wins in equilibrium.
  
  The “in equilibrium” there must be unnecessary, right? If the first player always wins in equilibrium but might not otherwise, then the second player has a clear incentive to make sure things are not in equilibrium (which is a contradiction).
  
  I buy the argument given some assumptions. I note that this doesn’t really apply to my setting, IE, we have to do more than merely change the scoring to be more like the usual debate scoring.
  
  In particular, this line doesn’t seem true without a further assumption:
  
  The opponent will always have to recurse into one of the subclaims (or concede).
  
  Had I considered this argument in the context of my original post, I would have rejected it on the grounds that the opponent can object by other means. For example,
  
  User: What is 2+2?
  
  Player 1: 2+2 is 4. I break down the problem into ‘what is 2-1’ (call it x), ‘what is 2+1’ (call it y), and ‘what is x+y’. I claim x=1, y=3, and x+y=4. Clearly, if all three of these are true, then 2+2=4, since I’ve only added 1 and subtracted 1, so x+y must equal 2+2.
  
  Player 2: 2+2 is 5, though. This is because 2+3 is 6, and 3 is 1 more than 2, so, 2+2 must be 1 less than 6. But 5 is 1 less than 6.
  
  Player 1: If my argument is wrong, which of my assumptions is wrong?
  
  Player 2: I don’t know. Perhaps you have a huge argument tree which I would have to spend a long time examining. I can tell something is wrong, however, thanks to my argument. If you think it should always be possible to point out which specific assumption is incorrect, which of my assumptions do you think is incorrect?
  
  Clearly, if Player 2 is allowed to object by other means like this, Player 2 would greatly prefer to—Player 2 wants to avoid descending Player 1′s argument tree if at all possible.
  
  If successful, Player 2 gets Player 1 to descend Player 2′s infinite tree (which continues to decompose the problem via the same strategy as above), thus never finding the contradiction.
  
  Player 1 can of course ask Player 2 how long the argument tree will be, which does put Player 2 at risk of contradiction in the infinite debate setting. But if debates are finite (but unknown length), Player 2 can claim a large size that makes the contradiction difficult to uncover. Or, Player 2 could avoid answering the question (which seems possible if the players are free to choose which parts of the argument to prioritize in giving their responses).
  
  So I buy your argument under the further assumption that the argument must recurse on Player 1′s claims (rather than allowing Player 2 to make an alternative argument which might get recursed on instead). Or, in a true infinite-debate setting, provided that there’s also a way to force opponents to answer questions (EG the judge assumes you’re lying if you repeatedly dodge a question).
  - Rohin Shah 26 Jan 2021 19:04 UTC
    LW: 2 AF: 2
    AF Parent
    For generic question Q and correct answer A, I make no assumption that there are convincing arguments for A one way or the other (honest or dishonest). If player 1 simply states A, player 2 would be totally within rights to say “player 1 offers no argument for its position” and receive points for that, as far as I am concerned.
    I think at this point I want a clearer theoretical model of what assumptions you are and aren’t making. Like, at this point, I’m feeling more like “why are we even talking about defeaters; there are much bigger issues in this setup”.
    I wouldn’t be surprised at this point if most of the claims I’ve made are actually false under the assumptions you seem to be working under.
    Another problem with your argument—WFC says that all leaf nodes are human-verifiable, whereas some leaf nodes in your suggested tree have to be taken on faith (a fact which you mention, but don’t address).
    Not sure what you want me to “address”. The leaf nodes that are taken on faith really are true under optimal play, which is what happens at equilibrium.
    Had I considered this argument in the context of my original post, I would have rejected it on the grounds that the opponent can object by other means.
    This is why I prefer the version of debate outlined here, where both sides make a claim and then each side must recurse down on the other’s arguments. I didn’t realize you were considering a version where you don’t have to specifically rebut the other player’s arguments.
    The “in equilibrium” there must be unnecessary, right? If the first player always wins in equilibrium but might not otherwise, then the second player has a clear incentive to make sure things are not in equilibrium (which is a contradiction).
    I just meant to include the fact that the honest player is able to find the defeaters to dishonest arguments. If you include that in “the honest policy”, then I agree that “in equilibrium” is unnecessary. (I definitely could have phrased that better.)
    - abramdemski 26 Jan 2021 20:14 UTC
      LW: 4 AF: 4
      AF Parent
      
      Another problem with your argument—WFC says that all leaf nodes are human-verifiable, whereas some leaf nodes in your suggested tree have to be taken on faith (a fact which you mention, but don’t address).
      
      Not sure what you want me to “address”. The leaf nodes that are taken on faith really are true under optimal play, which is what happens at equilibrium.
      
      To focus on this part, because it seems quite tractable --
      
      Let’s grant for the sake of argument that these nodes are true under optimal play. How can the human verify that? Optimal play is quite a computationally complex object.
      
      WFC as you stated it says that these leaf nodes are verifiable:
      
      (Weak version) For any question Q with correct answer A, there exists a tree of decompositions T arguing this such that at every leaf a human can verify that the answer to the question at the leaf is correct, [...]
      
      So the tree you provide doesn’t satisfy this condition. Yet you say:
      
      I claim that this is a tree that satisfies the weak Factored Cognition hypothesis, if the human can take on faith the answers to “What is the best defeater to X”.
      
      To me this reads like “this would satisfy WFC if WFC allowed humans to take leaf nodes on faith, rather than verify them”.
      
      Am I still misunderstanding something big about the kind of argument you are trying to make?
      - Rohin Shah 26 Jan 2021 21:08 UTC
        LW: 4 AF: 4
        AF Parent
        Am I still misunderstanding something big about the kind of argument you are trying to make?
        I don’t think so, but to formalize the argument a bit more, let’s define this new version of the WFC:
        Special-Tree WFC: For any question Q with correct answer A, there exists a tree of decompositions T arguing this such that:
        Every internal node has exactly one child leaf of the form “What is the best defeater to X?” whose answer is auto-verified,
        For every other leaf node, a human can verify that the answer to the question at that node is correct,
        For every internal node, a human can verify that the answer to the question is correct, assuming that the subanswers are correct.
        (As before, we assume that the human never verifies something incorrect, unless the subanswers they were given were incorrect.)
        Claim 1: (What I thought was) your assumption ⇒ Special-Tree WFC, using the construction I gave.
        Claim 2: Special-Tree WFC + assumption of optimal play ⇒ honesty is an equilibrium, using the same argument that applies to regular WFC + assumption of optimal play.
        Idk whether this is still true under the assumptions you’re using; I think claim 1 in particular is probably not true under your model.
        abramdemski 26 Jan 2021 22:24 UTC
        LW: 4 AF: 4
        AF Parent
        Ah, OK, so you were essentially assuming that humans had access to an oracle which could verify optimal play.
        
        This sort of makes sense, as a human with access to a debate system in equilibrium does have such an oracle. I still don’t yet buy your whole argument, for reasons being discussed in another branch of our conversation, but this part makes enough sense.
        
        Your argument also has some leaf nodes which use the terminology “fully defeat”, in contrast to “defeat”. I assume this means that in the final analysis (after expanding the chain of defeaters) this refutation was a true one, not something ultimately refuted.
        
        If so, it seems you also need an oracle for that, right? Unless you think that can be inferred from some fact about optimal play. EG, that a player bothered to say it rather than concede.
        
        In any case it seems like you could just make the tree out of the claim “A is never fully defeated”:
        
        Node(Q, A, [Leaf("Is A ever fully defeated?", "No")])
        Rohin Shah 26 Jan 2021 22:46 UTC
        LW: 2 AF: 2
        AF Parent
        Your argument also has some leaf nodes which use the terminology “fully defeat”, in contrast to “defeat”.
        I don’t think I ever use “fully defeat” in a leaf? It’s always in a Node, or in a Tree (which is a recursive call to the procedure that creates the tree).
        I assume this means that in the final analysis (after expanding the chain of defeaters) this refutation was a true one, not something ultimately refuted.
        Yes, that’s what I mean by “fully defeat”.
        abramdemski 26 Jan 2021 23:49 UTC
        LW: 4 AF: 4
        AF Parent
        I don’t think I ever use “fully defeat” in a leaf? It’s always in a Node, or in a Tree (which is a recursive call to the procedure that creates the tree).
        Ahhhhh, OK. I missed that that was supposed to be a recursive call, and interpreted it as a leaf node based on the overall structure. So I was still missing an important part of your argument. I thought you were trying to offer a static tree in that last part, rather than a procedure.
    - abramdemski 26 Jan 2021 21:46 UTC
      LW: 2 AF: 2
      AF Parent
      
      For generic question Q and correct answer A, I make no assumption that there are convincing arguments for A one way or the other (honest or dishonest). If player 1 simply states A, player 2 would be totally within rights to say “player 1 offers no argument for its position” and receive points for that, as far as I am concerned.
      
      I think at this point I want a clearer theoretical model of what assumptions you are and aren’t making. Like, at this point, I’m feeling more like “why are we even talking about defeaters; there are much bigger issues with this setup”.
      
      An understandable response. Of course I could try to be more clear about my assumptions (and might do so).
      
      But it seems to me that the current misunderstandings are mostly about how I was jumping off from the original debate paper (in which responses are a back-and-forth sequence, and players answer in unstructured text, with no rules except those the judge may enforce) whereas you were using more recent proposals as your jumping-off-point.
      
      Moreover, rather than trying to go over the basic assumptions, I think we can make progress (at least on my side) by focusing narrowly on how your argument is supposed to go through for an example.
      
      So, I propose as a concrete counterexample to your argument:
      
      Q: What did Plato have for lunch two days before he met Socrates? (Suppose for the sake of argument that these two men existed, and met.) A: Fish. (Suppose for the sake of argument that this is factually true, but cannot be known to us by any argument.)
      
      I propose that the tree you provided via your argument cannot be a valid tree-computation of what Plato had for lunch that day, because assertions about which player conceded, what statements have defeaters, etc. have little bearing on the question of what Plato had for lunch (because we simply don’t have enough information to establish this by any argument, no matter how large, and neither do the players). This seems to me like a big problem with your approach, not a finicky issue due to some misunderstanding of my assumptions about debate.
      
      Surely it’s clear that, in general, not all correct answers have convincing arguments supporting them?
      
      Again, this is why I was quick to assume that by “correct answer” you surely meant something weaker, eg an operational definition. Yet you insist that you mean the strong thing.
      
      Not to get caught up arguing whether WFC is true (I’m saying it’s really clearly false as stated, but that’s not my focus—after all, whether WFC is true or false has no bearing on the question of whether my assumption implies it). Rather, I’d prefer to focus on the question of how your proposed tree would deal with that case.
      
      According to you, what would the tree produced via your argument look like, and how would it be a valid tree-computation of what Plato had for lunch?
      
      Had I considered this argument in the context of my original post, I would have rejected it on the grounds that the opponent can object by other means.
      
      This is why I prefer the version of debate outlined here, where both sides make a claim and then each side must recurse down on the other’s arguments. I didn’t realize you were considering a version where you don’t have to specifically rebut the other player’s arguments.
      
      Generally speaking, I didn’t have the impression that these more complex setups had significantly different properties with respect to my primary concerns. This could be wrong. But in particular, I don’t see that that setup forces specific rebuttal, either:
      
      At the beginning of each round, one debater is defending a claim and the other is objecting to it. [...]
      
      Each player then simultaneously may make any number of objections to the other player’s argument. [...]
      
      If there are any challenged objections and the depth limit is >0, then we choose one challenged objection to recurse on:
      
      We don’t define how to make this choice, so in order to be conservative we’re currently allowing the malicious debater to choose which to recurse on.
      
      (Emphasis added.) So it seems to me like a dishonest player still can, in this system, focus on building up their own argument rather than pointing out where they think their opponent went wrong. Or, even if they do object, they can simply choose to recurse on the honest player’s objections instead (so that they get to explore their own infinite argument tree, rather than the honest, bounded tree of their opponent).
      - Rohin Shah 26 Jan 2021 22:44 UTC
        LW: 4 AF: 4
        AF Parent
        So, I propose as a concrete counterexample to your argument:
        Q: What did Plato have for lunch two days before he met Socrates? (Suppose for the sake of argument that these two men existed, and met.) A: Fish. (Suppose for the sake of argument that this is factually true, but cannot be known to us by any argument.)
        Ah, I see what you mean now. Yeah, I agree that debate is not going to answer fish in the scenario above. Sorry for using “correct” in a confusing way.
        When I say that you get the correct answer, or the honest answer, I mean something like “you get the one that we would want our AI systems to give, if we knew everything that the AI systems know”. An alternative definition is that the answer should be “accurately reporting what humans would justifiably believe given lots of time to reflect” rather than “accurately corresponding to reality”.
        (The two definitions above come apart when you talk about questions that the AI system knows about but can’t justify to humans, e.g. “how do you experience the color red”, but I’m ignoring those questions for now.)
        (I’d prefer to talk about “accurately reporting the AI’s beliefs”, but there’s no easy way to define what beliefs an AI system has, and also in any case debate .)
        In the example you give, the AI systems also couldn’t reasonably believe that the answer is “fish”, and so the “correct” / “honest” answer in this case is “the question can’t be answered given our current information”, or “the best we can do is guess the typical food for an ancient Greek diet”, or something along those lines. If the opponent tried to dispute this, then you simply challenge them to do better; they will then fail to do so. Given the assumption of optimal play, this absence of evidence is evidence of absence, and you can conclude that the answer is correct.
        So it seems to me like a dishonest player still can, in this system, focus on building up their own argument rather than pointing out where they think their opponent went wrong.
        In this case they’re acknowledging that the other player’s argument is “correct” (i.e. more likely than not to win if we continued recursively debating). While this doesn’t guarantee their loss, it sure seems like a bad sign.
        Or, even if they do object, they can simply choose to recurse on the honest player’s objections instead (so that they get to explore their own infinite argument tree, rather than the honest, bounded tree of their opponent).
        Yes, I agree this is true under those specific rules. But if there was a systematic bias in this way, you could just force exploration of both player’s arguments in parallel (at only 2x the cost).
        abramdemski 27 Jan 2021 1:50 UTC
        LW: 2 AF: 2
        AF Parent
        When I say that you get the correct answer, or the honest answer, I mean something like “you get the one that we would want our AI systems to give, if we knew everything that the AI systems know”. An alternative definition is that the answer should be “accurately reporting what humans would justifiably believe given lots of time to reflect” rather than “accurately corresponding to reality”.
        Right, OK.
        So my issue with using “correct” like this in the current context is that it hides too much and creates a big risk of conflation. By no means do I assume—or intend to argue—that my debate setup can correctly answer every question in the sense above. Yet, of course, I intend for my system to provide “correct answers” in some sense. (A sense which has less to do with providing the best answer possible from the information available, and more to do with avoiding mistakes.)
        If I suppose “correct” is close to has an honest argument which gives enough information to convince a human (let’s call this correct $_{a b r a m}$ ), then I would buy your original argument. Yet this would do little to connect my argument to factored cognition.
        If I suppose “correct” is close to what HCH would say (correct $_{p a u l}$ ) then I still don’t buy your argument at all, for precisely the same reason that I don’t buy the version where “correct” simply means “true”—namely, because correct $_{p a u l}$ answers don’t necessarily win in my debate setup, any more than correct $_{t r u e}$ answers do.
        Of course neither of those would be very sensible definitions of “correct”, since either would make the WFC claim uninteresting.
        Let’s suppose that “correct” at least includes answers which an ideal HCH would give (IE, assuming no alignment issues with HCH, and assuming the human uses pretty good question-answering strategies). I hope you think that’s a fair supposition—your original comment was trying to make a meaningful statement about the relationship between my thing and factored cognition, so it seems reasonable to interpret WFC in that light.
        I furthermore suppose that actual literal PSPACE problems can be safely computed by HCH. (This isn’t really clear, given safety restrictions you’d want to place on HCH, but we can think about that more if you want to object.)
        So my new counterexample is PSPACE problems. Although I suppose an HCH can answer such questions, I have no reason to think my proposed debate system can. Therefore I think the tree you propose (which iiuc amounts to a proof of “A is never fully defeated”) won’t systematically be correct (A may be defeated by virtue of its advocate not being able to provide the human with enough reason to think it is true).
        ---
        Other responses:
        In this case they’re acknowledging that the other player’s argument is “correct” (i.e. more likely than not to win if we continued recursively debating). While this doesn’t guarantee their loss, it sure seems like a bad sign.
        In this position, I would argue to the judge that not being able to identify specifically which assumption of my opponent’s is incorrect does not indicate concession, precisely because my opponent may have a complex web of argumentation which hides the contradiction deep in the branches or pushes it off to infinity.
        Yes, I agree this is true under those specific rules. But if there was a systematic bias in this way, you could just force exploration of both player’s arguments in parallel (at only 2x the cost).
        Agreed—I was only pointing out that the setup you linked didn’t have the property you mentioned, not that it would be particularly hard to get.
        Rohin Shah 27 Jan 2021 4:46 UTC
        LW: 2 AF: 2
        AF Parent
        Re: correctness, I think I actually misled you with my last comment; I lost track of the original point. I endorse the thing I said as a definition of what I’m usually hoping for with debate, but I don’t think that was the definition I was using here.
        I think in this comment thread I’ve been defining an honest answer as one that can be justified via arguments that eventually don’t have any defeaters. I thought this was what you were going for since you started with the assumption that dishonest answers always have defeaters—while this doesn’t strictly imply my definition, that just seemed like the obvious theoretical model to be using. (I didn’t consciously realize I was making that assumption.)
        I still think that working with this “definition” is an interesting theoretical exercise, though I agree it doesn’t correspond to reality. Looking back I can see that you were talking about how this “definition” doesn’t actually correspond to the realistic situation, but I didn’t realize that’s what you were saying, sorry about that.
        abramdemski 29 Jan 2021 15:49 UTC
        LW: 2 AF: 2
        AF Parent
        
        I think in this comment thread I’ve been defining an honest answer as one that can be justified via arguments that eventually don’t have any defeaters. I thought this was what you were going for since you started with the assumption that dishonest answers always have defeaters—while this doesn’t strictly imply my definition, that just seemed like the obvious theoretical model to be using. (I didn’t consciously realize I was making that assumption.)
        
        Right, I agree—I was more or less taking that as a definition of honesty. However, this doesn’t mean we’d want to take it as a working definition of correctness, particularly not for WFC.
        
        Re: correctness, I think I actually misled you with my last comment; I lost track of the original point. I endorse the thing I said as a definition of what I’m usually hoping for with debate, but I don’t think that was the definition I was using here.
        
        It sounds like you are saying you intended the first case I mentioned in my previous argument, IE:
        
        If I suppose “correct” is close to has an honest argument which gives enough information to convince a human (let’s call this correct $_{a b r a m}$ ), then I would buy your original argument. Yet this would do little to connect my argument to factored cognition.
        
        Do you agree with my conclusion that your argument would, then, have little to do with factored cognition? (If so, I want to edit my first reply to you to summarize the eventual conclusion of this and other parts of the discussion, to make it easier on future readers—so I’m asking if you agree with that summary.)
        
        To elaborate: the “correct $_{a b r a m}$ version” of WFC says, essentially, that NP-like problems (more specifically: informal questions whose answers have supporting arguments which humans can verify, though humans may also incorrectly verify wrong answers/arguments) have computation trees which humans can inductively verify.
        
        This is at best a highly weakened version of factored cognition, and generally, deals with a slightly different issue (ie tries to deal with the problem of verifying incorrect arguments).
        
        I still think that working with this “definition” is an interesting theoretical exercise, though I agree it doesn’t correspond to reality. Looking back I can see that you were talking about how this “definition” doesn’t actually correspond to the realistic situation, but I didn’t realize that’s what you were saying, sorry about that.
        
        I think you are taking this somewhat differently than I am taking this. The fact that correct $_{a b r a m}$ doesn’t serve as a plausible notion of “correctness” (in your sense) and that honest $_{a b r a m}$ doesn’t serve as a plausible notion of “honesty” (in the sense of getting the AI system to reveal all information it has) isn’t especially a crux for the applicability of my analysis, imho. My crux is, rather, the “no indescribably bad argument” thesis.
        
        If bad arguments are always describably bad, then it’s plausible that some debate method could systematically avoid manipulation and perform well, even if stronger factored-cognition type theses failed. Which is the main point here.
        Rohin Shah 29 Jan 2021 18:52 UTC
        LW: 2 AF: 2
        AF Parent
        If bad arguments are always describably bad, then it’s plausible that some debate method could systematically avoid manipulation and perform well, even if stronger factored-cognition type theses failed. Which is the main point here.
        I think you also need that at least some of the time good arguments are not describably bad (i.e. they don’t have defeaters); otherwise there is no way to distinguish between good and bad arguments. (Or you need to posit some external-to-debate method of giving the AI system information about good vs bad arguments.)
        Do you agree with my conclusion that your argument would, then, have little to do with factored cognition?
        I think I’m still a bit confused on the relation of Factored Cognition to this comment thread, but I do agree at least that the main points we were discussing are not particularly related to Factored Cognition. (In particular, the argument that zero-sum is fine can be made without any reference to Factored Cognition.) So I think that summary seems fine.
        abramdemski 29 Jan 2021 21:33 UTC
        LW: 2 AF: 2
        AF Parent
        
        I think you also need that at least some of the time good arguments are not describably bad
        
        While I agree that there is a significant problem, I’m not confident I’d want to make that assumption.
        
        As I mentioned in the other branch, I was thinking of differences in how easy lies are to find, rather than existence. It seems natural to me to assume that every individual thing does have a convincing counterargument, if we look through the space of all possible strings (not because I’m sure this is true, but because it’s the conservative assumption—I have no strong reason to think humans aren’t that hackable, even if we are less vulnerable to adversarial examples in some sense).
        
        So my interpretation of “finding the honest equilibrium” in debate was, you enter a regime where the the (honest) debate strategies are too powerful, such that small mutations toward lying are defeated because they’re not lying well.
        
        All of this was an implicit model, not a carefully thought out position on my part. Thus, I was saying things like “50% probability the opponent finds a plausible lie” which don’t make sense as an equilibrium analysis—in true equilibrium, players would know all the plausible lies, and know their opponents knew them, etc.
        
        But, this kind of uncertainty still makes sense for any realistic level of training.
        
        Furthermore, one might hope that the rational-player perspective (in which the risks and rewards of lying are balanced in order to determine whether to lie) simply doesn’t apply, because in order to suddenly start lying well, a player would have to invent the whole art of lying in one gradient descent step. So, if one is sufficiently stuck in an honesty “basin”, one cannot jump over the sides, even if there are perfectly good plays which involve doing so. I offer this as the steelman of the implicit position I had.
        
        Overall, making this argument more explicit somewhat reduces my credulity in debate, because:
        
        I was not explicitly recognizing that talk of “honest equilibrium” relies on assumptions about misleading counterarguments not existing, as opposed to weaker assumptions about them being hard to find (I think this also applies to regular debate, not just my framework here)
        Steelmanning “dishonest arguments are harder to make” as an argument about training procedures, rather than about equilibrium, seems to rest on assumptions which would be difficult to gain confidence in.
        
        -2/+1 Scoring
        
        It’s worth explicitly noting that this weakens my argument for the −2/+1 scoring.
        
        I was arguing that although −2/+1 can seriously disadvantage honest strategies in some cases (as you mention, it could mean the first player can lie, and the second player keeps silent to avoid retribution), it fixes a problem within the would-be honest attractor basin. Namely, I argued that it cut off otherwise problematic cases where dishonest players can force a tie (in expectation) by continuing to argue forever.
        
        Now, the assumptions under which this is a problem are somewhat complex (as we’ve discussed). But I must assume there is a seeming counterargument to almost anything (at least, enough that the dishonest player can steer toward conversational territory in which this is true). Which means we can’t be making an argument about the equilibrium being good. Therefore, if this concern is relevant for us, we must be arguing about training rather than equilibrium behavior. (In the sense I discussed above.)
        
        But if we’re arguing about training, we hopefully still have some assumption about lies being harder to find (during training). So, there should already be some other way to argue that you can’t go on dishonestly arguing forever.
        
        So the situation would have to be pretty weird for −2/+1 to be useful.
        
        (I don’t by any means intend to say that “a dishonest player continuing to argue in order to get a shot at not losing” isn’t a problem—just that if it’s a problem, it’s probably not a problem −2/+1 scoring can help with.)
        Rohin Shah 29 Jan 2021 21:43 UTC
        LW: 2 AF: 2
        AF Parent
        Yeah all of this makes sense to me; I agree that you could make an argument about the difference in difficulty of finding defeaters to good vs. bad arguments, and that could then be used to say “debate will in practice lead to honest policies”.