Thanks for taking the time. I now understand all of your arguments and am convinced that most of my original criticisms are wrong or inapplicable. This has greatly increased my understanding of and confidence in AI safety via debate. Thank you for that. I updated the post accordingly. Here are the updated versions (copied from above):
Re complexity: Update 2: I misunderstood Rohin’s response. He actually argues that, in cases where a claim X breaks down into claims X1 and X2, the debater has to choose which one is more effective to attack, i.e. it is not able to backtrack later on (maybe it still can by making the tree larger—not sure). Thus, my original claim about complexity is not a problem since the debate will always be a linear path through a potentially exponentially large tree.
Re deception: Update2: We were able to agree on the bottleneck. We both believe that the claim “it is harder to lie than to refute a lie” is the question that determines whether debate works or not. Rohin was able to convince me that it is easier to refute a lie than I originally thought and I, therefore, believe more in the merits of AI safety via debate. The main intuition that changed is that the refuter mostly has to continue poking holes rather than presenting an alternative in one step. In the “flawed physics” setting described above, for example, the opponent doesn’t have to explain the alternative physics setting in the first step. They could just continue to point out flaws and inconsistencies with the current setting and then slowly introduce the new system of physics and how it would solve these inconsistencies.
Re final conclusion: Update2: Rohin mostly convinced me that my remaining criticisms don’t hold or are less strong than I thought. I now believe that the only real problem with debate (in a setting with well-intentioned verifiers) is when the claim “it is harder to lie than to refute a lie” doesn’t hold. However, I updated that it is often much easier to refute a lie than I anticipated because refuting the lie only entails poking a sufficiently large hole into the claim and doesn’t necessitate presenting an alternative solution.
Thanks for taking the time. I now understand all of your arguments and am convinced that most of my original criticisms are wrong or inapplicable. This has greatly increased my understanding of and confidence in AI safety via debate. Thank you for that. I updated the post accordingly. Here are the updated versions (copied from above):
Re complexity:
Update 2: I misunderstood Rohin’s response. He actually argues that, in cases where a claim X breaks down into claims X1 and X2, the debater has to choose which one is more effective to attack, i.e. it is not able to backtrack later on (maybe it still can by making the tree larger—not sure). Thus, my original claim about complexity is not a problem since the debate will always be a linear path through a potentially exponentially large tree.
Re deception:
Update2: We were able to agree on the bottleneck. We both believe that the claim “it is harder to lie than to refute a lie” is the question that determines whether debate works or not. Rohin was able to convince me that it is easier to refute a lie than I originally thought and I, therefore, believe more in the merits of AI safety via debate. The main intuition that changed is that the refuter mostly has to continue poking holes rather than presenting an alternative in one step. In the “flawed physics” setting described above, for example, the opponent doesn’t have to explain the alternative physics setting in the first step. They could just continue to point out flaws and inconsistencies with the current setting and then slowly introduce the new system of physics and how it would solve these inconsistencies.
Re final conclusion:
Update2: Rohin mostly convinced me that my remaining criticisms don’t hold or are less strong than I thought. I now believe that the only real problem with debate (in a setting with well-intentioned verifiers) is when the claim “it is harder to lie than to refute a lie” doesn’t hold. However, I updated that it is often much easier to refute a lie than I anticipated because refuting the lie only entails poking a sufficiently large hole into the claim and doesn’t necessitate presenting an alternative solution.
Excellent, I’m glad we’ve converged!