Wei Dai comments on Thoughts on AI Safety via Debate

Wei Dai 10 May 2018 5:44 UTC
4 points
Especially if we aren’t allowed to talk about RGB values, and instead have to mention subjective colors;
I assume we are allowed to talk about RGB values because in the actual AI debates, there is no legitimate reason for the AI debaters to talk about subjective impressions. They should always just talk about objective measurements or clear external facts (like what a certain sentence on a certain web page says). If a debater tries to talk about subjective impressions, the judge can just rule against that debater (since again, there seems to be no legitimate reason to do that), then the AIs will learn not to do that.
Also consider the case of adversarial examples; if I take the reference image, determine the minimal infinity norm perturbation that results in an image of a different class, and then argue with reference to my image, presumably there’s no one pixel we disagree about strongly (because the pixel we disagree about most strongly determines the infinity norm that I tried to minimize), and thus it’s hard to establish a blatant lie.
If we can talk about RGB values, we don’t need to establish a lie based on a single pixel. The honest debater can give a whole bunch of RGB pixel values, which even if it doesn’t conclusively establish a lie will make the truth telling strategy have a higher winning probability, which would be enough to make both debaters converge to telling the truth during training.
But this seems not at all useful in the case of uncertainty, and also not at all useful in the case of pernicious disagreement, where I disagree about no low-level features but dispute all inferences that could be drawn from those features.
Not sure I understand the part about uncertainty. About disputing inferences, it also seems to me that the judge needs to have enough area expertise to judge the validity of the inferences being disputed. In some cases the honest debater may be able to win by educating the judge (e.g., by pointing to a relevant section in a textbook). In other cases this may not be possible and I’m not sure what the solution is there.
ETA: The authors talk about this and related issues in section 5.3, with the following conclusions:
The complexity theoretic analogy suggests that these difficulties can be overcome by a sufficiently sophisticated judge under simple conditions. But that result may not hold up when AI systems need to use powerful but informal reasoning, or if humans cannot formalize their criteria for judgment. We are optimistic that we can learn a great deal about these issues by conducting debates between humans, in domains where experts have much more time than the judge, have access to a large amount of external information, or have expertise that the judge lacks.
- Vaniver 10 May 2018 7:20 UTC
  4 points
  Parent
  The honest debater can give a whole bunch of RGB pixel values, which even if it doesn’t conclusively establish a lie will make the truth telling strategy have a higher winning probability, which would be enough to make both debaters converge to telling the truth during training.
  One thing that I find myself optimizing for is compression (which seems like a legitimate reason for the actual AI debates to talk about subjective impressions as opposed to objective measurements). It seems to me like if the debaters both just provide the judge with the whole image using natural language, then the honest debater is sure to win (both of them provide the judge with an image, both of them tell the judge one pixel to check from the other person, the honest debater correctly identifies a fake pixel and is immune to a similar attack from the dishonest debater). But this only makes sense if talk with the debaters is cheap, and external validation is expensive, which is not the case in the real use cases, where the judge’s time evaluating arguments is expensive (or the debaters have more subject matter expertise than the judge, such that just giving the judge raw pixel values is not enough for the judge to correctly classify the image).
  Not sure I understand the part about uncertainty.
  Most of my discussion is about the cat vs. dog debate game with humans, where it’s assumed that the two debaters both know what the ground truth is (and the judge could verify it, given the image). But perhaps it is difficult for the debaters to discern the actual ground truth, and there is an honest disagreement—that is, both debaters agree on every pixel value, but think those pixel values add up to different classifications. (The ambiguous cat-dog drawing is such an example for the image classification problem, and one can imagine different classifiers that classify such a drawing differently. Or, with regular images, different classifiers may make different random errors due to incomplete training.) Such honest disagreement is what I mean by ‘uncertainty.’ Ideally, in such a system, the debate will quickly focus the judge’s attention on the core crux and allow them to quickly settle the issue (or determine that it isn’t possible to settle with the information available).
  (In the case of advanced intelligence, consider the case where the AI system is proposing a drug for human consumption, where some of its models of human preferences and physiology think that the drug would be net good, and other models think it would be net bad. It seems like a debate-style model would be good at exposing the core disagreements to human supervisors, but that it is highly unlikely that those disagreements could be resolved by the equivalent of checking a single pixel.)
  - Wei Dai 10 May 2018 19:50 UTC
    2 points
    Parent
    
    But perhaps it is difficult for the debaters to discern the actual ground truth
    
    I think in those cases the debaters are supposed give probabilistic answers and support them with probabilistic arguments. The paper talks about this a bit but not enough to give me a good idea of what those kinds of debates would actually look like (which was one of my complaints about the paper).