Rohin Shah comments on The limits of AI safety via debate

Rohin Shah May 13, 2022, 10:29 PM
4 points
Thanks for making updates!
Rohin argued that this is not the case, because every debate is ultimately only about the plausibility of the original statement independent of the number of subcomponents it logically breaks down to (or at least that’s how I understood him).
No, that’s not what I mean.
The idea with debate is that you can have justified belief in some claim X if you see one expert (the “proponent”) agree with claim X, and another equally capable expert (the “antagonist”) who is solely focused on defeating the first expert is unable to show a problem with claim X. The hope is that the antagonist fails in its task when X is true, and succeeds when X is false.
We only give the antagonist one try at showing a problem with claim X. If the support for the claim breaks down into two necessary subcomponents, the antagonist should choose the one that is most problematic; it doesn’t get to backtrack and talk about the other subcomponent.
This does mean that the judge would not be able to tell you why the other subcomponent is true, but the fact that the antagonist didn’t choose to talk about that subcomponent suggests that the human judge would find that subcomponent more trustworthy than the one the antagonist did choose to talk about.
I feel like the question of whether the debater is truthful or not only depends on whether they would be rewarded to be so. However, I (currently) don’t see strong reasons for the debater to be always truthful.
I mean, the reason is “if the debater is not truthful, the opponent will point that out, and the debater will lose”. This in turn depends on the central claim in the debate paper:
Claim. In the debate game, it is harder to lie than to refute a lie.
In cases where this claim isn’t true, I agree debate won’t get you the truth. I agree in the “flawed physics” example if you have a short debate then deception is incentivized.
As I mentioned in the previous comment, I do think deception is a problem that you would worry about, but it’s only in cases where it is easier to lie than to refute the lie. I think it is inaccurate to summarize this as “debate assumes that AI is not deceptive”; there’s a much more specific assumption which is “it is harder to lie than to refute a lie” (which is way more plausible-sounding to me at least than “assumes that AI is not deceptive”).
- Marius Hobbhahn May 14, 2022, 9:16 AM
  9 points
  Parent
  Thanks for taking the time. I now understand all of your arguments and am convinced that most of my original criticisms are wrong or inapplicable. This has greatly increased my understanding of and confidence in AI safety via debate. Thank you for that. I updated the post accordingly. Here are the updated versions (copied from above):
  
  Re complexity:
  Update 2: I misunderstood Rohin’s response. He actually argues that, in cases where a claim X breaks down into claims X1 and X2, the debater has to choose which one is more effective to attack, i.e. it is not able to backtrack later on (maybe it still can by making the tree larger—not sure). Thus, my original claim about complexity is not a problem since the debate will always be a linear path through a potentially exponentially large tree.
  
  Re deception:
  Update2: We were able to agree on the bottleneck. We both believe that the claim “it is harder to lie than to refute a lie” is the question that determines whether debate works or not. Rohin was able to convince me that it is easier to refute a lie than I originally thought and I, therefore, believe more in the merits of AI safety via debate. The main intuition that changed is that the refuter mostly has to continue poking holes rather than presenting an alternative in one step. In the “flawed physics” setting described above, for example, the opponent doesn’t have to explain the alternative physics setting in the first step. They could just continue to point out flaws and inconsistencies with the current setting and then slowly introduce the new system of physics and how it would solve these inconsistencies.
  
  Re final conclusion:
  Update2: Rohin mostly convinced me that my remaining criticisms don’t hold or are less strong than I thought. I now believe that the only real problem with debate (in a setting with well-intentioned verifiers) is when the claim “it is harder to lie than to refute a lie” doesn’t hold. However, I updated that it is often much easier to refute a lie than I anticipated because refuting the lie only entails poking a sufficiently large hole into the claim and doesn’t necessitate presenting an alternative solution.
  - Rohin Shah May 14, 2022, 10:02 AM
    4 points
    Parent
    Excellent, I’m glad we’ve converged!