Marius Hobbhahn comments on The limits of AI safety via debate

Marius Hobbhahn 12 May 2022 11:45 UTC
8 points
Thanks for your detailed comment. Let me ask some clarifications. I will update the post afterward.

Assumption 1:

I understand where you are going but the underlying path in the tree might still be very long, right? The not-Fortnite-debater might argue that you couldn’t have played Fortnite because electricity doesn’t exist. Then the Fortnite-debater has to argue that it does exist, right?

Furthermore, I don’t see why it should just be one path in the tree. Some arguments have multiple necessary conditions/burdens. Why do I not have to prove all of them? Otherwise, the opponent in the debate can always answer with “OK assume everything you said is true, what about the other burden?”.

I’ll update this section once I understand your criticism better.

Assumption II:

Ok, let’s say that we are able to understand it after a billion years of looking at it. Or maybe we understand it after the heat death of the universe. Does that really change anything? Perhaps I should reframe it as “understanding the concept in principle (in a relevant sense)” or something like that.

I think the more compelling analogy to me is “could you teach your dog quantum physics” given lots of time and resources. I’m not sure the dog is able to understand. What do you think?

Assumptions III and IV:

These are practical problems of debate. I mostly wanted to point out that they could happen to the people running experiments with debate. I think they could also happen in a company, e.g. when the AI says things that are in the interest of the specific verifier but not their manager. I think this point can be summarized as “as long as humans are the verifiers, human flaws can be breaking points of AI safety via debate”
I’ll rephrase them to emphasize this more.
Framing: what is AI safety used for
I think your framing of AI safety as a tool for AI safety researchers reduces some of the problems I described and I will rewrite the relevant passages. However, while the interests of the AI company might be less complex, they are not necessarily straightforward, e.g. when leadership has different ideals than the safety team and would thus verify different players in the final node.

Assumption V:
I agree with you that in a perfect setting this could not happen. However, in real life with debates on TV or even with well-intentioned scientists who have held wrong beliefs for a long time even though they were sometimes confronted with the truth and an explanation for it, we see this often. I think it’s more of a question of how much we trust the verifier to make the right call given a good explanation than a fundamental disagreement of the method.

Assumption VI:
The example is not optimal. I see that now and will change it. However, the underlying argument still seems true to me. The incentive of the AI is to get the human to declare it as the winner, right? Therefore, it will use all its tools to make it win. If it has superhuman intelligence and a very accurate model of the verifier(s) it will say things that make the humans give it the win. If part of that strategy is to be deceptive, why wouldn’t it use that? I think this is very important and I currently don’t understand your reasoning. Let me know if you think I’m missing something.
- Rohin Shah 13 May 2022 7:18 UTC
  4 points
  Parent
  I understand where you are going but the underlying path in the tree might still be very long, right? The not-Fortnite-debater might argue that you couldn’t have played Fortnite because electricity doesn’t exist. Then the Fortnite-debater has to argue that it does exist, right?
  Yes. It doesn’t seem like this has to be that long, since you break down the claim into multiple subclaims and only recurse down into one of the subclaims. Again, the 1800-person doesn’t have to be shown the full reasoning justifying the existence of electricity, they just have to observe that the opponent debater was unable to poke a hole in the “electricity exists” claim.
  Otherwise, the opponent in the debate can always answer with “OK assume everything you said is true, what about the other burden?”.
  If the opponent previously claimed that X, and then the debater showed actually not-X, and the opponent says “okay, sure, not-X, but what about Y”, they just immediately lose the debate. That is, you tell your human judges that in such cases they should award the victory to the debater that said X. The debater can say “you’re moving the goalposts” to make it really obvious to the judge.
  Ok, let’s say that we are able to understand it after a billion years of looking at it. Or maybe we understand it after the heat death of the universe. Does that really change anything?
  Yes! It means that there probably exists an exponential-sized tree that produces the right answer, and so debate could plausibly recreate the answer that that reasoning would come to!
  (I think it is first worth understanding how debate can produce the same answers as an exponential-sized tree. As a simple, clean example, debate in chess with arbitrarily intelligent players but a human judge leads to optimal play, even though if the human computed optimal play using the direct brute force approach it would be done only well after the heat death of the universe.)
  (Also, Figure 1 in AI Safety Needs Social Scientists kinda gets at the “implicit tree”.)
  I think the more compelling analogy to me is “could you teach your dog quantum physics” given lots of time and resources. I’m not sure the dog is able to understand. What do you think?
  You might be able to teach your dog quantum physics; it seems plausible that in a billion years you could teach your dog how to communicate, use language, have compositional concepts, apply logic, etc, and then once you have those you can explain quantum physics the way you’d explain it to a human.
  But I agree that debate with dog judges won’t work, because the current capabilities of dogs aren’t past the universality threshold. I agree that if humans don’t actually cross the relevant universality threshold, i.e. our AIs know some stuff that we can’t comprehend even with arbitrarily large amounts of time,
  when leadership has different ideals than the safety team and would thus verify different players in the final node
  I’m not sure what you’re imagining here. Debater A tries to deceive you, debater B points that out, safety team wants to award the victory to B but leadership wants to give it to A? If leadership wants its AI systems to be deceiving them in order to destroy humanity, then technical work is not going to save you; get better leadership.
  If you mean something like “the safety team would like AI systems that always tell the truth but leadership is okay with AI systems that exaggerate sometimes” I agree that could happen but I don’t see why I should care about that.
  I think it’s more of a question of how much we trust the verifier to make the right call given a good explanation than a fundamental disagreement of the method.
  Sure, I agree this is an important thing to get right, and that it’s not obvious that we get it right (but I also think it is not obvious that we get it wrong).
  If it has superhuman intelligence and a very accurate model of the verifier(s) it will say things that make the humans give it the win. If part of that strategy is to be deceptive, why wouldn’t it use that?
  I totally agree that if it can win by being deceptive it will be incentivized to do so (and probably will do so).
  Why believe that it can win by being deceptive? There’s an opposing debater of equal intelligence waiting to pounce on any evidence of deception!
  (Tbc, I do think there are reasons to worry about deception but they’re very different, e.g. “the deception is implemented by looking for a factorization of RSA-2048 and defecting if one is found, and so the opponent debater can’t notice this until the defection actually happens”, which can be fixed-in-theory by giving debaters access to each other’s internals.)
  - Marius Hobbhahn 13 May 2022 18:54 UTC
    8 points
    Parent
    Thank you for the detailed responses. You have convinced me of everything but two questions. I have updated the text to reflect that. The two remaining questions are (copied from text):
    
    On complexity: There was a second disagreement about complexity. I argued that some debates actually break down into multiple necessary conditions, e.g. if you want to argue that you played Fortnite you have to show that it is possible to play Fortnite that and then that it is plausible that you played it. The pro-Fortnite debater has to show both claims while the anti-Fortnite debater has to defeat only one. Rohin argued that this is not the case, because every debate is ultimately only about the plausibility of the original statement independent of the number of subcomponents it logically breaks down to (or at least that’s how I understood him).
    
    On deception: This is the only point where Rohin hasn’t convinced me yet. He argues that the debaters have no incentive to be deceptive since the other debater is equally capable and has an incentive to point out this deception. I think this is true—as long as the reward for pointing out deception is bigger than alternative strategies, e.g. being deceptive yourself, you are incentivized to be truthful.
    
    Let’s say, for example, our conception of physics was fundamentally flawed and both debaters knew this. To win the debate, one (truthful) debater would have to argue that our current concept of physics is flawed and establish the alternative theory while the other one (deceptive) could argue within our current framework of physics and sound much more plausible to the humans. The truthful debater is only rewarded for their honesty when the human verifier waits long enough to understand the alternative physics explanation before giving the win to the deceptive debater. In case the human verifier stops early, deception is rewarded, right? What am I missing?
    In general, I feel like the question of whether the debater is truthful or not only depends on whether they would be rewarded to be so. However, I (currently) don’t see strong reasons for the debater to be always truthful. To me, the bottleneck seems to be which kind of behavior humans intentionally or unintentionally reward during training and I can imagine enough scenarios in which we accidentally reward dishonest or deceptive behavior.
    - Rohin Shah 13 May 2022 22:29 UTC
      4 points
      Parent
      Thanks for making updates!
      Rohin argued that this is not the case, because every debate is ultimately only about the plausibility of the original statement independent of the number of subcomponents it logically breaks down to (or at least that’s how I understood him).
      No, that’s not what I mean.
      The idea with debate is that you can have justified belief in some claim X if you see one expert (the “proponent”) agree with claim X, and another equally capable expert (the “antagonist”) who is solely focused on defeating the first expert is unable to show a problem with claim X. The hope is that the antagonist fails in its task when X is true, and succeeds when X is false.
      We only give the antagonist one try at showing a problem with claim X. If the support for the claim breaks down into two necessary subcomponents, the antagonist should choose the one that is most problematic; it doesn’t get to backtrack and talk about the other subcomponent.
      This does mean that the judge would not be able to tell you why the other subcomponent is true, but the fact that the antagonist didn’t choose to talk about that subcomponent suggests that the human judge would find that subcomponent more trustworthy than the one the antagonist did choose to talk about.
      I feel like the question of whether the debater is truthful or not only depends on whether they would be rewarded to be so. However, I (currently) don’t see strong reasons for the debater to be always truthful.
      I mean, the reason is “if the debater is not truthful, the opponent will point that out, and the debater will lose”. This in turn depends on the central claim in the debate paper:
      Claim. In the debate game, it is harder to lie than to refute a lie.
      In cases where this claim isn’t true, I agree debate won’t get you the truth. I agree in the “flawed physics” example if you have a short debate then deception is incentivized.
      As I mentioned in the previous comment, I do think deception is a problem that you would worry about, but it’s only in cases where it is easier to lie than to refute the lie. I think it is inaccurate to summarize this as “debate assumes that AI is not deceptive”; there’s a much more specific assumption which is “it is harder to lie than to refute a lie” (which is way more plausible-sounding to me at least than “assumes that AI is not deceptive”).
      - Marius Hobbhahn 14 May 2022 9:16 UTC
        9 points
        Parent
        Thanks for taking the time. I now understand all of your arguments and am convinced that most of my original criticisms are wrong or inapplicable. This has greatly increased my understanding of and confidence in AI safety via debate. Thank you for that. I updated the post accordingly. Here are the updated versions (copied from above):
        
        Re complexity:
        Update 2: I misunderstood Rohin’s response. He actually argues that, in cases where a claim X breaks down into claims X1 and X2, the debater has to choose which one is more effective to attack, i.e. it is not able to backtrack later on (maybe it still can by making the tree larger—not sure). Thus, my original claim about complexity is not a problem since the debate will always be a linear path through a potentially exponentially large tree.
        
        Re deception:
        Update2: We were able to agree on the bottleneck. We both believe that the claim “it is harder to lie than to refute a lie” is the question that determines whether debate works or not. Rohin was able to convince me that it is easier to refute a lie than I originally thought and I, therefore, believe more in the merits of AI safety via debate. The main intuition that changed is that the refuter mostly has to continue poking holes rather than presenting an alternative in one step. In the “flawed physics” setting described above, for example, the opponent doesn’t have to explain the alternative physics setting in the first step. They could just continue to point out flaws and inconsistencies with the current setting and then slowly introduce the new system of physics and how it would solve these inconsistencies.
        
        Re final conclusion:
        Update2: Rohin mostly convinced me that my remaining criticisms don’t hold or are less strong than I thought. I now believe that the only real problem with debate (in a setting with well-intentioned verifiers) is when the claim “it is harder to lie than to refute a lie” doesn’t hold. However, I updated that it is often much easier to refute a lie than I anticipated because refuting the lie only entails poking a sufficiently large hole into the claim and doesn’t necessitate presenting an alternative solution.
        Rohin Shah 14 May 2022 10:02 UTC
        4 points
        Parent
        Excellent, I’m glad we’ve converged!