Vaniver comments on Writeup: Progress on AI Safety via Debate

Vaniver 6 Feb 2020 0:51 UTC
LW: 17 AF: 6
AF
This forces the dishonest debater to either commit to all the details of their argument ahead of time (in which case the honest debater can focus on the flaw), or to answer questions inconsistently (in which case the honest debater can exhibit this inconsistency to the judge).
Can we construct situations (or do they naturally arise) where the honest debater has similar problems? I think your strategy that involves showing context probably rules this out, since even if there are multiple correct arguments for the true position that rely on overloaded terms (so that x means 4 in one situation and 2 in another, because sometimes you’ve squared it and sometimes you haven’t), in context it’ll be clear what any referents should refer to or how the argument is trying to work.
Then A* guesses which definition came from B*. If they guess correctly, team A wins.
What happens if the answers are the same, or only contain stylistic differences? What happens if they guess incorrectly? Is only one cross-examination allowed per team game?
Even in the presence of a penalty for guessing incorrectly (like, say, losing if you called a cross-examination and your partner couldn’t identify which answer came from the other debate), the dishonest team might want to call one as soon as they think the chance of them convincing a judge is below 50%, because that’s the worst-case win-rate from blind guessing.
Supposing identical answers null out the cross-examination, you still have the problem of information leakage (because the team that calls the cross-examination gets to share more context than the team that doesn’t), which suggests that as soon as I have put myself into a pickle where there are two conflicting answers and I want to communicate to my partner that they should go left instead of right, I should call a cross-examination (so that my partner can read my context and hopefully come to the same conclusion).
And if there’s only one cross-examination allowed, because of the information leakage, then the dishonest team might want to do it as soon as possible to take the option off the table for the honest team.
On the other hand, it might be problematic for ML training if the judge signal only prefers completely honest play to dishonest play, and doesn’t reliably reward being less dishonest. The lack of a gradient towards more honest play may make it difficult to learn winning honest strategies.
Supposing both players are dishonest, and a substantial fraction of debates end because someone successfully catches the other person in a lie, then the player with a higher rate of blatant dishonesty is disfavored because their first catchable lie occurs sooner (in expectation). I think this gets you a gradient in the direction you need, but depends a lot on how much the judges reward being a correct stickler on irrelevant points (as soon as B realizes A has them in a vise on a particular point, B will move to arguing that the point is irrelevant).
- orthonormal 17 Feb 2020 18:35 UTC
  LW: 5 AF: 2
  AF Parent
  the dishonest team might want to call one as soon as they think the chance of them convincing a judge is below 50%, because that’s the worst-case win-rate from blind guessing
  I also think this is a fatal flaw with the existing two-person-team proposal; you need a system that gives you an epsilon chance of winning with it if you’re using it spuriously.
  I have what looks to me like an improvement, but there’s still a vulnerability:
  A challenges B by giving a yes-no question as well as a previous round to ask it. B answers, B* answers based on B’s notes up to that point, A wins outright if B and B* answer differently.
  (This has the side effect that A* doesn’t need to be involved, and so B can later challenge A. But of course we could get this under any such proposal by having teams larger than two!)
  The remaining vulnerability is that A could ask a question that is so abstruse (and irrelevant to the actual debate) that there’s a good chance an honest B and B* will answer it differently. (I’m thinking of more sophisticated versions of “if a tree falls in the forest” questions.)
  - Vaniver 17 Feb 2020 19:40 UTC
    LW: 2 AF: 1
    AF Parent
    This has the side effect that A* doesn’t need to be involved
    I thought the thing A* was doing was giving a measure of “answer differently” that was more reliable than something like ‘string comparison’. If B’s answer is “dog” and B*’s answer was “canine”, then hopefully those get counted as “the same” in situations where the difference is irrelevant and “different” in situations where the difference is relevant. If everything can be yes/no, then I agree this doesn’t lose you much, but I think this reduces the amount of trickery you can detect.
    That is, imagine one of those games where I’m thinking of a word, and you have to guess it, and you can ask questions that narrow down the space of possible words. One thing I could do is change my word whenever I think you’re getting close, but I have to do so to a different word that has all the properties I’ve revealed so far. (Or, like, each time I could answer in the way that leaves me with the largest set of possible words left, maximizing the time-to-completion.) If we do the thing where B says the word, and B* gets to look at B’s notes up to point X and say B’s word, then the only good strategy for B is to have the word in their notes (and be constrained by it), but this is resistant to reducing it to a yes/no question. (Even if the question is something like “is there a word in B’s notes?” B* might be able to tell that B will say “yes” even tho there isn’t a word in B’s notes; maybe because B says “hey I’m doing the strategy where I don’t have a word to be slippery, but pretend that I do have a word if asked” in the notes.)
    A wins outright if B and B* answer differently.
    As stated, I think this has a bigger vulnerability; B and B* just always answer the question with “yes.” One nice thing about yes/no questions is that maybe you can randomly flip them (so one gets “does B think the animal is a dog?” and the other gets “does B think the animal is not a dog?”) so there’s no preferred orientation, which would eat the “always say yes” strategy unless the question-flipping is detectable. (Since A is the one asking the question, A might limit themselves to easily reversible questions, but this constrains their ability to clamp down on trickery.)
    - orthonormal 20 Feb 2020 17:35 UTC
      LW: 2 AF: 1
      AF Parent
      As stated, I think this has a bigger vulnerability; B and B* just always answer the question with “yes.”
      Remember that this is also used to advance the argument. If A thinks B has such a strategy, A can ask the question in such a way that B’s “yes” helps A’s argument. But sure, there is something weird here.