The honest debater can give a whole bunch of RGB pixel values, which even if it doesn’t conclusively establish a lie will make the truth telling strategy have a higher winning probability, which would be enough to make both debaters converge to telling the truth during training.
One thing that I find myself optimizing for is compression (which seems like a legitimate reason for the actual AI debates to talk about subjective impressions as opposed to objective measurements). It seems to me like if the debaters both just provide the judge with the whole image using natural language, then the honest debater is sure to win (both of them provide the judge with an image, both of them tell the judge one pixel to check from the other person, the honest debater correctly identifies a fake pixel and is immune to a similar attack from the dishonest debater). But this only makes sense if talk with the debaters is cheap, and external validation is expensive, which is not the case in the real use cases, where the judge’s time evaluating arguments is expensive (or the debaters have more subject matter expertise than the judge, such that just giving the judge raw pixel values is not enough for the judge to correctly classify the image).
Not sure I understand the part about uncertainty.
Most of my discussion is about the cat vs. dog debate game with humans, where it’s assumed that the two debaters both know what the ground truth is (and the judge could verify it, given the image). But perhaps it is difficult for the debaters to discern the actual ground truth, and there is an honest disagreement—that is, both debaters agree on every pixel value, but think those pixel values add up to different classifications. (The ambiguous cat-dog drawing is such an example for the image classification problem, and one can imagine different classifiers that classify such a drawing differently. Or, with regular images, different classifiers may make different random errors due to incomplete training.) Such honest disagreement is what I mean by ‘uncertainty.’ Ideally, in such a system, the debate will quickly focus the judge’s attention on the core crux and allow them to quickly settle the issue (or determine that it isn’t possible to settle with the information available).
(In the case of advanced intelligence, consider the case where the AI system is proposing a drug for human consumption, where some of its models of human preferences and physiology think that the drug would be net good, and other models think it would be net bad. It seems like a debate-style model would be good at exposing the core disagreements to human supervisors, but that it is highly unlikely that those disagreements could be resolved by the equivalent of checking a single pixel.)
But perhaps it is difficult for the debaters to discern the actual ground truth
I think in those cases the debaters are supposed give probabilistic answers and support them with probabilistic arguments. The paper talks about this a bit but not enough to give me a good idea of what those kinds of debates would actually look like (which was one of my complaints about the paper).
One thing that I find myself optimizing for is compression (which seems like a legitimate reason for the actual AI debates to talk about subjective impressions as opposed to objective measurements). It seems to me like if the debaters both just provide the judge with the whole image using natural language, then the honest debater is sure to win (both of them provide the judge with an image, both of them tell the judge one pixel to check from the other person, the honest debater correctly identifies a fake pixel and is immune to a similar attack from the dishonest debater). But this only makes sense if talk with the debaters is cheap, and external validation is expensive, which is not the case in the real use cases, where the judge’s time evaluating arguments is expensive (or the debaters have more subject matter expertise than the judge, such that just giving the judge raw pixel values is not enough for the judge to correctly classify the image).
Most of my discussion is about the cat vs. dog debate game with humans, where it’s assumed that the two debaters both know what the ground truth is (and the judge could verify it, given the image). But perhaps it is difficult for the debaters to discern the actual ground truth, and there is an honest disagreement—that is, both debaters agree on every pixel value, but think those pixel values add up to different classifications. (The ambiguous cat-dog drawing is such an example for the image classification problem, and one can imagine different classifiers that classify such a drawing differently. Or, with regular images, different classifiers may make different random errors due to incomplete training.) Such honest disagreement is what I mean by ‘uncertainty.’ Ideally, in such a system, the debate will quickly focus the judge’s attention on the core crux and allow them to quickly settle the issue (or determine that it isn’t possible to settle with the information available).
(In the case of advanced intelligence, consider the case where the AI system is proposing a drug for human consumption, where some of its models of human preferences and physiology think that the drug would be net good, and other models think it would be net bad. It seems like a debate-style model would be good at exposing the core disagreements to human supervisors, but that it is highly unlikely that those disagreements could be resolved by the equivalent of checking a single pixel.)
I think in those cases the debaters are supposed give probabilistic answers and support them with probabilistic arguments. The paper talks about this a bit but not enough to give me a good idea of what those kinds of debates would actually look like (which was one of my complaints about the paper).