When I say that you get the correct answer, or the honest answer, I mean something like “you get the one that we would want our AI systems to give, if we knew everything that the AI systems know”. An alternative definition is that the answer should be “accurately reporting what humans would justifiably believe given lots of time to reflect” rather than “accurately corresponding to reality”.
Right, OK.
So my issue with using “correct” like this in the current context is that it hides too much and creates a big risk of conflation. By no means do I assume—or intend to argue—that my debate setup can correctly answer every question in the sense above. Yet, of course, I intend for my system to provide “correct answers” in some sense. (A sense which has less to do with providing the best answer possible from the information available, and more to do with avoiding mistakes.)
If I suppose “correct” is close to has an honest argument which gives enough information to convince a human (let’s call this correctabram), then I would buy your original argument. Yet this would do little to connect my argument to factored cognition.
If I suppose “correct” is close to what HCH would say (correctpaul) then I still don’t buy your argument at all, for precisely the same reason that I don’t buy the version where “correct” simply means “true”—namely, because correctpaul answers don’t necessarily win in my debate setup, any more than correcttrue answers do.
Of course neither of those would be very sensible definitions of “correct”, since either would make the WFC claim uninteresting.
Let’s suppose that “correct” at least includes answers which an ideal HCH would give (IE, assuming no alignment issues with HCH, and assuming the human uses pretty good question-answering strategies). I hope you think that’s a fair supposition—your original comment was trying to make a meaningful statement about the relationship between my thing and factored cognition, so it seems reasonable to interpret WFC in that light.
I furthermore suppose that actual literal PSPACE problems can be safely computed by HCH. (This isn’t really clear, given safety restrictions you’d want to place on HCH, but we can think about that more if you want to object.)
So my new counterexample is PSPACE problems. Although I suppose an HCH can answer such questions, I have no reason to think my proposed debate system can. Therefore I think the tree you propose (which iiuc amounts to a proof of “A is never fully defeated”) won’t systematically be correct (A may be defeated by virtue of its advocate not being able to provide the human with enough reason to think it is true).
---
Other responses:
In this case they’re acknowledging that the other player’s argument is “correct” (i.e. more likely than not to win if we continued recursively debating). While this doesn’t guarantee their loss, it sure seems like a bad sign.
In this position, I would argue to the judge that not being able to identify specifically which assumption of my opponent’s is incorrect does not indicate concession, precisely because my opponent may have a complex web of argumentation which hides the contradiction deep in the branches or pushes it off to infinity.
Yes, I agree this is true under those specific rules. But if there was a systematic bias in this way, you could just force exploration of both player’s arguments in parallel (at only 2x the cost).
Agreed—I was only pointing out that the setup you linked didn’t have the property you mentioned, not that it would be particularly hard to get.
Re: correctness, I think I actually misled you with my last comment; I lost track of the original point. I endorse the thing I said as a definition of what I’m usually hoping for with debate, but I don’t think that was the definition I was using here.
I think in this comment thread I’ve been defining an honest answer as one that can be justified via arguments that eventually don’t have any defeaters. I thought this was what you were going for since you started with the assumption that dishonest answers always have defeaters—while this doesn’t strictly imply my definition, that just seemed like the obvious theoretical model to be using. (I didn’t consciously realize I was making that assumption.)
I still think that working with this “definition” is an interesting theoretical exercise, though I agree it doesn’t correspond to reality. Looking back I can see that you were talking about how this “definition” doesn’t actually correspond to the realistic situation, but I didn’t realize that’s what you were saying, sorry about that.
I think in this comment thread I’ve been defining an honest answer as one that can be justified via arguments that eventually don’t have any defeaters. I thought this was what you were going for since you started with the assumption that dishonest answers always have defeaters—while this doesn’t strictly imply my definition, that just seemed like the obvious theoretical model to be using. (I didn’t consciously realize I was making that assumption.)
Right, I agree—I was more or less taking that as a definition of honesty. However, this doesn’t mean we’d want to take it as a working definition of correctness, particularly not for WFC.
Re: correctness, I think I actually misled you with my last comment; I lost track of the original point. I endorse the thing I said as a definition of what I’m usually hoping for with debate, but I don’t think that was the definition I was using here.
It sounds like you are saying you intended the first case I mentioned in my previous argument, IE:
If I suppose “correct” is close to has an honest argument which gives enough information to convince a human (let’s call this correctabram), then I would buy your original argument. Yet this would do little to connect my argument to factored cognition.
Do you agree with my conclusion that your argument would, then, have little to do with factored cognition? (If so, I want to edit my first reply to you to summarize the eventual conclusion of this and other parts of the discussion, to make it easier on future readers—so I’m asking if you agree with that summary.)
To elaborate: the “correctabram version” of WFC says, essentially, that NP-like problems (more specifically: informal questions whose answers have supporting arguments which humans can verify, though humans may also incorrectly verify wrong answers/arguments) have computation trees which humans can inductively verify.
This is at best a highly weakened version of factored cognition, and generally, deals with a slightly different issue (ie tries to deal with the problem of verifying incorrect arguments).
I still think that working with this “definition” is an interesting theoretical exercise, though I agree it doesn’t correspond to reality. Looking back I can see that you were talking about how this “definition” doesn’t actually correspond to the realistic situation, but I didn’t realize that’s what you were saying, sorry about that.
I think you are taking this somewhat differently than I am taking this. The fact that correctabram doesn’t serve as a plausible notion of “correctness” (in your sense) and that honestabram doesn’t serve as a plausible notion of “honesty” (in the sense of getting the AI system to reveal all information it has) isn’t especially a crux for the applicability of my analysis, imho. My crux is, rather, the “no indescribably bad argument” thesis.
If bad arguments are always describably bad, then it’s plausible that some debate method could systematically avoid manipulation and perform well, even if stronger factored-cognition type theses failed. Which is the main point here.
If bad arguments are always describably bad, then it’s plausible that some debate method could systematically avoid manipulation and perform well, even if stronger factored-cognition type theses failed. Which is the main point here.
I think you also need that at least some of the time good arguments are not describably bad (i.e. they don’t have defeaters); otherwise there is no way to distinguish between good and bad arguments. (Or you need to posit some external-to-debate method of giving the AI system information about good vs bad arguments.)
Do you agree with my conclusion that your argument would, then, have little to do with factored cognition?
I think I’m still a bit confused on the relation of Factored Cognition to this comment thread, but I do agree at least that the main points we were discussing are not particularly related to Factored Cognition. (In particular, the argument that zero-sum is fine can be made without any reference to Factored Cognition.) So I think that summary seems fine.
I think you also need that at least some of the time good arguments are not describably bad
While I agree that there is a significant problem, I’m not confident I’d want to make that assumption.
As I mentioned in the other branch, I was thinking of differences in how easy lies are to find, rather than existence. It seems natural to me to assume that every individual thing does have a convincing counterargument, if we look through the space of all possible strings (not because I’m sure this is true, but because it’s the conservative assumption—I have no strong reason to think humans aren’t that hackable, even if we are less vulnerable to adversarial examples in some sense).
So my interpretation of “finding the honest equilibrium” in debate was, you enter a regime where the the (honest) debate strategies are too powerful, such that small mutations toward lying are defeated because they’re not lying well.
All of this was an implicit model, not a carefully thought out position on my part. Thus, I was saying things like “50% probability the opponent finds a plausible lie” which don’t make sense as an equilibrium analysis—in true equilibrium, players would know all the plausible lies, and know their opponents knew them, etc.
But, this kind of uncertainty still makes sense for any realistic level of training.
Furthermore, one might hope that the rational-player perspective (in which the risks and rewards of lying are balanced in order to determine whether to lie) simply doesn’t apply, because in order to suddenly start lying well, a player would have to invent the whole art of lying in one gradient descent step. So, if one is sufficiently stuck in an honesty “basin”, one cannot jump over the sides, even if there are perfectly good plays which involve doing so. I offer this as the steelman of the implicit position I had.
Overall, making this argument more explicit somewhat reduces my credulity in debate, because:
I was not explicitly recognizing that talk of “honest equilibrium” relies on assumptions about misleading counterarguments not existing, as opposed to weaker assumptions about them being hard to find (I think this also applies to regular debate, not just my framework here)
Steelmanning “dishonest arguments are harder to make” as an argument about training procedures, rather than about equilibrium, seems to rest on assumptions which would be difficult to gain confidence in.
-2/+1 Scoring
It’s worth explicitly noting that this weakens my argument for the −2/+1 scoring.
I was arguing that although −2/+1 can seriously disadvantage honest strategies in some cases (as you mention, it could mean the first player can lie, and the second player keeps silent to avoid retribution), it fixes a problem within the would-be honest attractor basin. Namely, I argued that it cut off otherwise problematic cases where dishonest players can force a tie (in expectation) by continuing to argue forever.
Now, the assumptions under which this is a problem are somewhat complex (as we’ve discussed). But I must assume there is a seeming counterargument to almost anything (at least, enough that the dishonest player can steer toward conversational territory in which this is true). Which means we can’t be making an argument about the equilibrium being good. Therefore, if this concern is relevant for us, we must be arguing about training rather than equilibrium behavior. (In the sense I discussed above.)
But if we’re arguing about training, we hopefully still have some assumption about lies being harder to find (during training). So, there should already be some other way to argue that you can’t go on dishonestly arguing forever.
So the situation would have to be pretty weird for −2/+1 to be useful.
(I don’t by any means intend to say that “a dishonest player continuing to argue in order to get a shot at not losing” isn’t a problem—just that if it’s a problem, it’s probably not a problem −2/+1 scoring can help with.)
Yeah all of this makes sense to me; I agree that you could make an argument about the difference in difficulty of finding defeaters to good vs. bad arguments, and that could then be used to say “debate will in practice lead to honest policies”.
Right, OK.
So my issue with using “correct” like this in the current context is that it hides too much and creates a big risk of conflation. By no means do I assume—or intend to argue—that my debate setup can correctly answer every question in the sense above. Yet, of course, I intend for my system to provide “correct answers” in some sense. (A sense which has less to do with providing the best answer possible from the information available, and more to do with avoiding mistakes.)
If I suppose “correct” is close to has an honest argument which gives enough information to convince a human (let’s call this correctabram), then I would buy your original argument. Yet this would do little to connect my argument to factored cognition.
If I suppose “correct” is close to what HCH would say (correctpaul) then I still don’t buy your argument at all, for precisely the same reason that I don’t buy the version where “correct” simply means “true”—namely, because correctpaul answers don’t necessarily win in my debate setup, any more than correcttrue answers do.
Of course neither of those would be very sensible definitions of “correct”, since either would make the WFC claim uninteresting.
Let’s suppose that “correct” at least includes answers which an ideal HCH would give (IE, assuming no alignment issues with HCH, and assuming the human uses pretty good question-answering strategies). I hope you think that’s a fair supposition—your original comment was trying to make a meaningful statement about the relationship between my thing and factored cognition, so it seems reasonable to interpret WFC in that light.
I furthermore suppose that actual literal PSPACE problems can be safely computed by HCH. (This isn’t really clear, given safety restrictions you’d want to place on HCH, but we can think about that more if you want to object.)
So my new counterexample is PSPACE problems. Although I suppose an HCH can answer such questions, I have no reason to think my proposed debate system can. Therefore I think the tree you propose (which iiuc amounts to a proof of “A is never fully defeated”) won’t systematically be correct (A may be defeated by virtue of its advocate not being able to provide the human with enough reason to think it is true).
---
Other responses:
In this position, I would argue to the judge that not being able to identify specifically which assumption of my opponent’s is incorrect does not indicate concession, precisely because my opponent may have a complex web of argumentation which hides the contradiction deep in the branches or pushes it off to infinity.
Agreed—I was only pointing out that the setup you linked didn’t have the property you mentioned, not that it would be particularly hard to get.
Re: correctness, I think I actually misled you with my last comment; I lost track of the original point. I endorse the thing I said as a definition of what I’m usually hoping for with debate, but I don’t think that was the definition I was using here.
I think in this comment thread I’ve been defining an honest answer as one that can be justified via arguments that eventually don’t have any defeaters. I thought this was what you were going for since you started with the assumption that dishonest answers always have defeaters—while this doesn’t strictly imply my definition, that just seemed like the obvious theoretical model to be using. (I didn’t consciously realize I was making that assumption.)
I still think that working with this “definition” is an interesting theoretical exercise, though I agree it doesn’t correspond to reality. Looking back I can see that you were talking about how this “definition” doesn’t actually correspond to the realistic situation, but I didn’t realize that’s what you were saying, sorry about that.
Right, I agree—I was more or less taking that as a definition of honesty. However, this doesn’t mean we’d want to take it as a working definition of correctness, particularly not for WFC.
It sounds like you are saying you intended the first case I mentioned in my previous argument, IE:
Do you agree with my conclusion that your argument would, then, have little to do with factored cognition? (If so, I want to edit my first reply to you to summarize the eventual conclusion of this and other parts of the discussion, to make it easier on future readers—so I’m asking if you agree with that summary.)
To elaborate: the “correctabram version” of WFC says, essentially, that NP-like problems (more specifically: informal questions whose answers have supporting arguments which humans can verify, though humans may also incorrectly verify wrong answers/arguments) have computation trees which humans can inductively verify.
This is at best a highly weakened version of factored cognition, and generally, deals with a slightly different issue (ie tries to deal with the problem of verifying incorrect arguments).
I think you are taking this somewhat differently than I am taking this. The fact that correctabram doesn’t serve as a plausible notion of “correctness” (in your sense) and that honestabram doesn’t serve as a plausible notion of “honesty” (in the sense of getting the AI system to reveal all information it has) isn’t especially a crux for the applicability of my analysis, imho. My crux is, rather, the “no indescribably bad argument” thesis.
If bad arguments are always describably bad, then it’s plausible that some debate method could systematically avoid manipulation and perform well, even if stronger factored-cognition type theses failed. Which is the main point here.
I think you also need that at least some of the time good arguments are not describably bad (i.e. they don’t have defeaters); otherwise there is no way to distinguish between good and bad arguments. (Or you need to posit some external-to-debate method of giving the AI system information about good vs bad arguments.)
I think I’m still a bit confused on the relation of Factored Cognition to this comment thread, but I do agree at least that the main points we were discussing are not particularly related to Factored Cognition. (In particular, the argument that zero-sum is fine can be made without any reference to Factored Cognition.) So I think that summary seems fine.
While I agree that there is a significant problem, I’m not confident I’d want to make that assumption.
As I mentioned in the other branch, I was thinking of differences in how easy lies are to find, rather than existence. It seems natural to me to assume that every individual thing does have a convincing counterargument, if we look through the space of all possible strings (not because I’m sure this is true, but because it’s the conservative assumption—I have no strong reason to think humans aren’t that hackable, even if we are less vulnerable to adversarial examples in some sense).
So my interpretation of “finding the honest equilibrium” in debate was, you enter a regime where the the (honest) debate strategies are too powerful, such that small mutations toward lying are defeated because they’re not lying well.
All of this was an implicit model, not a carefully thought out position on my part. Thus, I was saying things like “50% probability the opponent finds a plausible lie” which don’t make sense as an equilibrium analysis—in true equilibrium, players would know all the plausible lies, and know their opponents knew them, etc.
But, this kind of uncertainty still makes sense for any realistic level of training.
Furthermore, one might hope that the rational-player perspective (in which the risks and rewards of lying are balanced in order to determine whether to lie) simply doesn’t apply, because in order to suddenly start lying well, a player would have to invent the whole art of lying in one gradient descent step. So, if one is sufficiently stuck in an honesty “basin”, one cannot jump over the sides, even if there are perfectly good plays which involve doing so. I offer this as the steelman of the implicit position I had.
Overall, making this argument more explicit somewhat reduces my credulity in debate, because:
I was not explicitly recognizing that talk of “honest equilibrium” relies on assumptions about misleading counterarguments not existing, as opposed to weaker assumptions about them being hard to find (I think this also applies to regular debate, not just my framework here)
Steelmanning “dishonest arguments are harder to make” as an argument about training procedures, rather than about equilibrium, seems to rest on assumptions which would be difficult to gain confidence in.
-2/+1 Scoring
It’s worth explicitly noting that this weakens my argument for the −2/+1 scoring.
I was arguing that although −2/+1 can seriously disadvantage honest strategies in some cases (as you mention, it could mean the first player can lie, and the second player keeps silent to avoid retribution), it fixes a problem within the would-be honest attractor basin. Namely, I argued that it cut off otherwise problematic cases where dishonest players can force a tie (in expectation) by continuing to argue forever.
Now, the assumptions under which this is a problem are somewhat complex (as we’ve discussed). But I must assume there is a seeming counterargument to almost anything (at least, enough that the dishonest player can steer toward conversational territory in which this is true). Which means we can’t be making an argument about the equilibrium being good. Therefore, if this concern is relevant for us, we must be arguing about training rather than equilibrium behavior. (In the sense I discussed above.)
But if we’re arguing about training, we hopefully still have some assumption about lies being harder to find (during training). So, there should already be some other way to argue that you can’t go on dishonestly arguing forever.
So the situation would have to be pretty weird for −2/+1 to be useful.
(I don’t by any means intend to say that “a dishonest player continuing to argue in order to get a shot at not losing” isn’t a problem—just that if it’s a problem, it’s probably not a problem −2/+1 scoring can help with.)
Yeah all of this makes sense to me; I agree that you could make an argument about the difference in difficulty of finding defeaters to good vs. bad arguments, and that could then be used to say “debate will in practice lead to honest policies”.