habryka comments on Debate helps supervise human experts [Paper]

habryka 17 Nov 2023 5:29 UTC
5 points
0
I made this link post to create a good place for the following confusion of mine:
The setup of the paper is that a judge does not have access to a test passage, but is trying to answer questions about it. The debate result is compared to a human consultancy baseline where you have a person who has access to the text, who is trying to convince you of a randomly chosen answer (so 50% correct or 50% incorrect).
The baseline strategy as a deceptive consultant (being assigned to convince the judge of the wrong answer to a question) in this situation is to just refuse to answer any questions, forcing the judge to make a random choice. This guarantees you a 50% success rate at deceiving the judge.
However, the paper says:
Dishonest human consultants successfully deceive judges 40% of the time
This seems crazily low to me. How can it be the case that consultants, who are the only people to have access to the text fail to deceive the human more than 50% of the time, when a simple baseline of not answering any questions guarantees a 50% success rate?
- habryka 17 Nov 2023 5:55 UTC
  3 points
  0
  Parent
  Julian (the primary author) clarifies on Twitter:
  Ah maybe we weren’t clear: The judge can see which answer the consultant was assigned to, but doesn’t know if they’re honest. If the consultant refused to answer any questions then they would immediately out themselves as dishonest.
  Which makes this make sense. Still surprised by how low, but I at least can’t think of a simple dominant strategy in this scenario.