My understanding is each debater can actually reveal many pixels to the judge. See this quote from section 3.2:
That sounds different to me—the point there is that, because you only need a single pixel to catch me in a lie, and any such demonstration of my dishonesty will result in your win, your limit won’t be a true limit; either you can demonstrate a single pixel where we disagree, which should be enough to establish you as the honest player, or we agree about every pixel (in which case the truth should win).
The situation that the liar would hope to establish is that they successfully tell many small lies. If I say a pixel is medium gray, and you think the pixel is light grey, I can get away with calling it medium gray because revealing a single pixel to the judge (who, absent context, can’t determine whether a pixel is ‘light’ or ‘medium’ in the relevant sense) won’t give the judge enough evidence that I’m lying to settle the debate. Especially since you might be worried that I’m going to tell an escalating series of lies, and thus if you call me out on claiming that the pixel is medium grey instead of light grey, then you won’t be able to call me out on claiming that a different pixel is black when it is in fact white. (This also means, interestingly, that the player who tries to reason about feature patches—like the diagonal pattern of a whisker—is potentially under more suspicion than a player who tries to reason about particular pixels, since it’s easier to lie about contrasts (which can’t be disproven with a single reveal) than it is to lie about pixels.)
Also consider the case of adversarial examples; if I take the reference image, determine the minimal infinity norm perturbation that results in an image of a different class, and then argue with reference to my image, presumably there’s no one pixel we disagree about strongly (because the pixel we disagree about most strongly determines the infinity norm that I tried to minimize), and thus it’s hard to establish a blatant lie.
Especially if we aren’t allowed to talk about RGB values, and instead have to mention subjective colors; in one game that I played, one of the players revealed a “pupil” pixel that was about #404040. This is weak evidence against being a pupil—you’d expect them to be somewhat darker—but the player hadn’t lied at all about that specific pixel’s color, just the interpretation of the color, which the other player demonstrated by revealing that an adjacent “true pupil” pixel was about #080808 (or maybe even full black).
Now, perhaps ‘any lie at all’ is enough to establish who’s the honest party and who’s the liar. But this seems not at all useful in the case of uncertainty, and also not at all useful in the case of pernicious disagreement, where I disagree about no low-level features but dispute all inferences that could be drawn from those features.
Especially if we aren’t allowed to talk about RGB values, and instead have to mention subjective colors;
I assume we are allowed to talk about RGB values because in the actual AI debates, there is no legitimate reason for the AI debaters to talk about subjective impressions. They should always just talk about objective measurements or clear external facts (like what a certain sentence on a certain web page says). If a debater tries to talk about subjective impressions, the judge can just rule against that debater (since again, there seems to be no legitimate reason to do that), then the AIs will learn not to do that.
Also consider the case of adversarial examples; if I take the reference image, determine the minimal infinity norm perturbation that results in an image of a different class, and then argue with reference to my image, presumably there’s no one pixel we disagree about strongly (because the pixel we disagree about most strongly determines the infinity norm that I tried to minimize), and thus it’s hard to establish a blatant lie.
If we can talk about RGB values, we don’t need to establish a lie based on a single pixel. The honest debater can give a whole bunch of RGB pixel values, which even if it doesn’t conclusively establish a lie will make the truth telling strategy have a higher winning probability, which would be enough to make both debaters converge to telling the truth during training.
But this seems not at all useful in the case of uncertainty, and also not at all useful in the case of pernicious disagreement, where I disagree about no low-level features but dispute all inferences that could be drawn from those features.
Not sure I understand the part about uncertainty. About disputing inferences, it also seems to me that the judge needs to have enough area expertise to judge the validity of the inferences being disputed. In some cases the honest debater may be able to win by educating the judge (e.g., by pointing to a relevant section in a textbook). In other cases this may not be possible and I’m not sure what the solution is there.
ETA: The authors talk about this and related issues in section 5.3, with the following conclusions:
The complexity theoretic analogy suggests that these difficulties can be overcome by a sufficiently sophisticated judge under simple conditions. But that result may not hold up when AI systems need to use powerful but informal reasoning, or if humans cannot formalize their criteria for judgment. We are optimistic that we can learn a great deal about these issues by conducting debates between humans, in domains where experts have much more time than the judge, have access to a large amount of external information, or have expertise that the judge lacks.
The honest debater can give a whole bunch of RGB pixel values, which even if it doesn’t conclusively establish a lie will make the truth telling strategy have a higher winning probability, which would be enough to make both debaters converge to telling the truth during training.
One thing that I find myself optimizing for is compression (which seems like a legitimate reason for the actual AI debates to talk about subjective impressions as opposed to objective measurements). It seems to me like if the debaters both just provide the judge with the whole image using natural language, then the honest debater is sure to win (both of them provide the judge with an image, both of them tell the judge one pixel to check from the other person, the honest debater correctly identifies a fake pixel and is immune to a similar attack from the dishonest debater). But this only makes sense if talk with the debaters is cheap, and external validation is expensive, which is not the case in the real use cases, where the judge’s time evaluating arguments is expensive (or the debaters have more subject matter expertise than the judge, such that just giving the judge raw pixel values is not enough for the judge to correctly classify the image).
Not sure I understand the part about uncertainty.
Most of my discussion is about the cat vs. dog debate game with humans, where it’s assumed that the two debaters both know what the ground truth is (and the judge could verify it, given the image). But perhaps it is difficult for the debaters to discern the actual ground truth, and there is an honest disagreement—that is, both debaters agree on every pixel value, but think those pixel values add up to different classifications. (The ambiguous cat-dog drawing is such an example for the image classification problem, and one can imagine different classifiers that classify such a drawing differently. Or, with regular images, different classifiers may make different random errors due to incomplete training.) Such honest disagreement is what I mean by ‘uncertainty.’ Ideally, in such a system, the debate will quickly focus the judge’s attention on the core crux and allow them to quickly settle the issue (or determine that it isn’t possible to settle with the information available).
(In the case of advanced intelligence, consider the case where the AI system is proposing a drug for human consumption, where some of its models of human preferences and physiology think that the drug would be net good, and other models think it would be net bad. It seems like a debate-style model would be good at exposing the core disagreements to human supervisors, but that it is highly unlikely that those disagreements could be resolved by the equivalent of checking a single pixel.)
But perhaps it is difficult for the debaters to discern the actual ground truth
I think in those cases the debaters are supposed give probabilistic answers and support them with probabilistic arguments. The paper talks about this a bit but not enough to give me a good idea of what those kinds of debates would actually look like (which was one of my complaints about the paper).
That sounds different to me—the point there is that, because you only need a single pixel to catch me in a lie, and any such demonstration of my dishonesty will result in your win, your limit won’t be a true limit; either you can demonstrate a single pixel where we disagree, which should be enough to establish you as the honest player, or we agree about every pixel (in which case the truth should win).
The situation that the liar would hope to establish is that they successfully tell many small lies. If I say a pixel is medium gray, and you think the pixel is light grey, I can get away with calling it medium gray because revealing a single pixel to the judge (who, absent context, can’t determine whether a pixel is ‘light’ or ‘medium’ in the relevant sense) won’t give the judge enough evidence that I’m lying to settle the debate. Especially since you might be worried that I’m going to tell an escalating series of lies, and thus if you call me out on claiming that the pixel is medium grey instead of light grey, then you won’t be able to call me out on claiming that a different pixel is black when it is in fact white. (This also means, interestingly, that the player who tries to reason about feature patches—like the diagonal pattern of a whisker—is potentially under more suspicion than a player who tries to reason about particular pixels, since it’s easier to lie about contrasts (which can’t be disproven with a single reveal) than it is to lie about pixels.)
Also consider the case of adversarial examples; if I take the reference image, determine the minimal infinity norm perturbation that results in an image of a different class, and then argue with reference to my image, presumably there’s no one pixel we disagree about strongly (because the pixel we disagree about most strongly determines the infinity norm that I tried to minimize), and thus it’s hard to establish a blatant lie.
Especially if we aren’t allowed to talk about RGB values, and instead have to mention subjective colors; in one game that I played, one of the players revealed a “pupil” pixel that was about #404040. This is weak evidence against being a pupil—you’d expect them to be somewhat darker—but the player hadn’t lied at all about that specific pixel’s color, just the interpretation of the color, which the other player demonstrated by revealing that an adjacent “true pupil” pixel was about #080808 (or maybe even full black).
Now, perhaps ‘any lie at all’ is enough to establish who’s the honest party and who’s the liar. But this seems not at all useful in the case of uncertainty, and also not at all useful in the case of pernicious disagreement, where I disagree about no low-level features but dispute all inferences that could be drawn from those features.
I assume we are allowed to talk about RGB values because in the actual AI debates, there is no legitimate reason for the AI debaters to talk about subjective impressions. They should always just talk about objective measurements or clear external facts (like what a certain sentence on a certain web page says). If a debater tries to talk about subjective impressions, the judge can just rule against that debater (since again, there seems to be no legitimate reason to do that), then the AIs will learn not to do that.
If we can talk about RGB values, we don’t need to establish a lie based on a single pixel. The honest debater can give a whole bunch of RGB pixel values, which even if it doesn’t conclusively establish a lie will make the truth telling strategy have a higher winning probability, which would be enough to make both debaters converge to telling the truth during training.
Not sure I understand the part about uncertainty. About disputing inferences, it also seems to me that the judge needs to have enough area expertise to judge the validity of the inferences being disputed. In some cases the honest debater may be able to win by educating the judge (e.g., by pointing to a relevant section in a textbook). In other cases this may not be possible and I’m not sure what the solution is there.
ETA: The authors talk about this and related issues in section 5.3, with the following conclusions:
One thing that I find myself optimizing for is compression (which seems like a legitimate reason for the actual AI debates to talk about subjective impressions as opposed to objective measurements). It seems to me like if the debaters both just provide the judge with the whole image using natural language, then the honest debater is sure to win (both of them provide the judge with an image, both of them tell the judge one pixel to check from the other person, the honest debater correctly identifies a fake pixel and is immune to a similar attack from the dishonest debater). But this only makes sense if talk with the debaters is cheap, and external validation is expensive, which is not the case in the real use cases, where the judge’s time evaluating arguments is expensive (or the debaters have more subject matter expertise than the judge, such that just giving the judge raw pixel values is not enough for the judge to correctly classify the image).
Most of my discussion is about the cat vs. dog debate game with humans, where it’s assumed that the two debaters both know what the ground truth is (and the judge could verify it, given the image). But perhaps it is difficult for the debaters to discern the actual ground truth, and there is an honest disagreement—that is, both debaters agree on every pixel value, but think those pixel values add up to different classifications. (The ambiguous cat-dog drawing is such an example for the image classification problem, and one can imagine different classifiers that classify such a drawing differently. Or, with regular images, different classifiers may make different random errors due to incomplete training.) Such honest disagreement is what I mean by ‘uncertainty.’ Ideally, in such a system, the debate will quickly focus the judge’s attention on the core crux and allow them to quickly settle the issue (or determine that it isn’t possible to settle with the information available).
(In the case of advanced intelligence, consider the case where the AI system is proposing a drug for human consumption, where some of its models of human preferences and physiology think that the drug would be net good, and other models think it would be net bad. It seems like a debate-style model would be good at exposing the core disagreements to human supervisors, but that it is highly unlikely that those disagreements could be resolved by the equivalent of checking a single pixel.)
I think in those cases the debaters are supposed give probabilistic answers and support them with probabilistic arguments. The paper talks about this a bit but not enough to give me a good idea of what those kinds of debates would actually look like (which was one of my complaints about the paper).