abramdemski comments on AI safety via market making

abramdemski 15 Jul 2020 19:13 UTC
LW: 8 AF: 6
AF
Hmm, this seems to rely on having the human trust the outputs of $M$ on questions that the human can’t verify. It’s not obvious to me that this is an assumption you can make without breaking the training process. The basic intuition is that you are hugely increasing the likelihood of bad gradients, since $A d v$ can point to some incorrect / garbage output from $M$ , and the human gives feedback as though this output is correct.
It works in the particular case that you outlined because there is essentially a DAG of arguments—every claim is broken down into “smaller” claims, that eventually reach a base case, and so everything eventually bottoms out in something the human can check. (In practice this will be built from the ground up during training, similarly as in Supervising strong learners by amplifying weak experts.)
However, in general it doesn’t seem like you can guarantee that every argument that $A d v$ gives will result in a “smaller” claim. You could get in cycles, where “8 − 5 = 2“ would be justified by $A d v$ saying that M(“What is 2 + 5?”) = 8, and similarly “2 + 5 = 8” would be justified by saying that M(“What is 8 − 5?”) = 2. (Imagine that these were much longer equations where the human can check the validity of the algebraic manipulation, but can’t check the validity of the overall equation.)
(Quoting later on in this comment thread:)
(Evan:)
Yes, though I believe that it should be possible (at least in theory) for H to ensure a DAG for any computable claim.
(You:)
I mean, sure, but H isn’t going to be able to do this in practice. (This feels like the same type of claim as “it should be possible (at least in theory) for H to provide a perfect reward that captures everything that H wants”.)
The human just has to be more convinced by the inductive argument than by other arguments. This seems natural, as the inductive argument is just a forward calculation.
In the number-summing example, let’s say $A d v$ tries to convince the human of an incorrect sum by referencing an instance where $M$ is incorrect, perhaps making an argument via subtraction as you illustrated. Then in the next round, $A d v$ will want to show that its previous argument was incorrect. If the strong inductive assumption is true, then it can do so, e.g. by “The last number in the list is 12. $M$ thinks that the sum of all but the last number is 143. 143+12=155. Therefore, the sum of the numbers is 155.” This is more straightforward than citing some longer list of numbers and subtracting, so the human should find it more convincing—especially if the human understands how the system works, and hence, knows that a partially trained M is more likely to be correct on simpler instances. If so, then during training, correctness will tend to “creep up” inductive trees.
This idea does seem much less natural in less computational settings, where there may not be an obvious notion of “simpler cases”.
- Rohin Shah 15 Jul 2020 20:44 UTC
  LW: 6 AF: 5
  AF Parent
  This idea does seem much less natural in less computational settings, where there may not be an obvious notion of “simpler cases”.
  Yes, my main claim is that in general it won’t be clear what the “simpler case” is. I agree that for simple algorithmic problems (e.g. the ones in Supervising strong learners by amplifying weak experts) you could probably rely on the DAG assumption. I probably should have given a non-arithmetic example to make that clearer.
  - abramdemski 16 Jul 2020 20:02 UTC
    LW: 6 AF: 4
    AF Parent
    What do you think about a similar DAG assumption in regular debate? Couldn’t debate agents similarly justify their assertions with claims that don’t descend a DAG that bottoms out in things the human can check? I don’t currently see how a debater who did this could be defeated by another debater.
    What links here?
    How should AI debate be judged? by abramdemski (15 Jul 2020 22:20 UTC; 49 points)
    How should AI debate be judged? by abramdemski (15 Jul 2020 22:20 UTC; 49 points)
    - Rohin Shah 16 Jul 2020 20:55 UTC
      LW: 6 AF: 4
      AF Parent
      I’m pretty unsure, having barely thought about it, but currently I lean towards it being okay—the main difference is that in debate you show an entire path down the argument tree, so if a false statement is justified by a cycle / circular argument, the other debater can point that out.
      If the length of the cycle is longer than the debate transcript, then this doesn’t work, but one hopes for some combination of a) this leads to a stalemate against honesty, rather than a win for the circular debater (since neither can refute the other), b) most questions that we care about can be resolved by a relatively short debate (the point of the PSPACE analogy), and c) such a strategy would lose against a debater who says early on “this debate can’t be decided in the time allotted”.
      What links here?
      How should AI debate be judged? by abramdemski (15 Jul 2020 22:20 UTC; 49 points)
      - abramdemski 19 Jul 2020 18:35 UTC
        LW: 6 AF: 4
        AF Parent
        Ok. I don’t see why these considerations make you optimistic rather than pessimistic, but then, I’m currently having more basic problems with debate which seem to be making me pessimistic about most claims about debate.
        Rohin Shah 20 Jul 2020 18:41 UTC
        LW: 9 AF: 6
        AF Parent
        I think the consideration “you can point out sufficiently short circular arguments” should at least make you feel better about debate than iterated amplification or market making—it’s one additional way in which you can avoid circular arguments, and afaict there isn’t a positive consideration for iterated amplification / market making that doesn’t also apply to debate.
        I don’t have a stable position about how optimistic we should be on some absolute scale.
        abramdemski 21 Jul 2020 18:37 UTC
        LW: 14 AF: 10
        AF Parent
        I think the consideration “you can point out sufficiently short circular arguments” should at least make you feel better about debate than iterated amplification or market making—it’s one additional way in which you can avoid circular arguments, and afaict there isn’t a positive consideration for iterated amplification / market making that doesn’t also apply to debate.
        My interpretation of the situation is this breaks the link between factored cognition and debate. One way to try to judge debate as an amplification proposal would have been to establish a link to HCH, by establishing that if there’s an HCH tree computing some answer, then debate can use the tree as an argument tree, with the reasons for any given claim being the children in the HCH tree. Such a link would transfer any trust we have in HCH to trust in debate. The use of non-DAG arguments by clever debaters would seem to break this link.
        OTOH, IDA may still have a strong story connecting it to HCH. Again, if we trusted HCH, we would then transfer that trust to IDA.
        Are you saying that we can break the link between IDA and HCH in a very similar way, but which is worse due having no means to reject very brief circular arguments?
        Rohin Shah 21 Jul 2020 20:04 UTC
        LW: 15 AF: 10
        AF Parent
        I think the issue is that vanilla HCH itself is susceptible to brief circular arguments, if humans lower down in the tree don’t get access to the context from humans higher up in the tree. E.g. assume a chain of humans for now:
        H1 gets the question “what is 100 + 100?” with budget 3
        H1 asks H2 “what is 2 * 100?” with budget 2
        H2 asks H3 “what is 100 + 100?” with budget 1
        H3 says “150”
        (Note the final answer stays the same as budget → infinity, as long as H continues “decomposing” the question the same way.)
        If HCH can always decompose questions into “smaller” parts (the DAG assumption) then this sort of pathological behavior doesn’t happen.
        evhub 20 Jul 2020 20:04 UTC
        LW: 3 AF: 1
        AF Parent
        
        afaict there isn’t a positive consideration for iterated amplification / market making that doesn’t also apply to debate
        
        For amplification, I would say that the fact that it has a known equilibrium (HCH) is a positive consideration that doesn’t apply to debate. For market making, I think that the fact that it gets to be per-step myopic is a positive consideration that doesn’t apply to debate. There are others too for both, though those are probably my biggest concerns in each case.
        Rohin Shah 20 Jul 2020 23:50 UTC
        LW: 4 AF: 3
        AF Parent
        Tbc, I’m specifically talking about:
        What do you think about a similar DAG assumption in regular debate?
        So I’m only evaluating whether or not I expect circular arguments to be an issue for these proposals. I agree that when evaluating the proposals on all merits there are arguments for the others that don’t apply to debate.
        evhub 21 Jul 2020 0:31 UTC
        LW: 2 AF: 1
        AF Parent
        Ah, I see—makes sense.
      - Beth Barnes 16 Nov 2020 21:20 UTC
        LW: 5 AF: 4
        AF Parent
        I think for debate you can fix the circular argument problem by requiring debaters to ‘pay’ (sacrifice some score) to recurse on a statement of their choice. If a debater repeatedly pays to recurse on things that don’t resolve before the depth limit, then they’ll lose.
        Rohin Shah 17 Nov 2020 19:26 UTC
        LW: 6 AF: 5
        AF Parent
        Hmm, I was imagining that the honest player would have to recurse on the statements in order to exhibit the circular argument, so it seems to me like this would penalize the honest player rather than the circular player. Can you explain what the honest player would do against the circular player such that this “payment” disadvantages the circular player?
        
        EDIT: Maybe you meant the case where the circular argument is too long to exhibit within the debate, but I think I still don’t see how this helps.
        Beth Barnes 18 Nov 2020 6:18 UTC
        LW: 16 AF: 12
        AF Parent
        Ah, yeah. I think the key thing is that by default a claim is not trusted unless the debaters agree on it.
        If the dishonest debater disputes some honest claim, where honest has an argument for their answer that actually bottoms out, dishonest will lose—the honest debater will pay to recurse until they get to a winning node.
        If the the dishonest debater makes some claim and plan to make a circular argument for it, the honest debater will give an alternative answer but not pay to recurse. If the dishonest debater doesn’t pay to recurse, the judge will just see these two alternative answers and won’t trust the dishonest answer. If the dishonest debater does pay to recurse but never actually gets to a winning node, they will lose.
        Does that make sense?
        
        What links here?
        abramdemski's comment on Debate Minus Factored Cognition by abramdemski (18 Jan 2021 23:28 UTC; 2 points)
        Rohin Shah 20 Nov 2020 2:25 UTC
        LW: 4 AF: 4
        AF Parent
        If the dishonest debater disputes some honest claim, where honest has an argument for their answer that actually bottoms out, dishonest will lose—the honest debater will pay to recurse until they get to a winning node.
        This part makes sense.
        If the the dishonest debater makes some claim and plan to make a circular argument for it, the honest debater will give an alternative answer but not pay to recurse. If the dishonest debater doesn’t pay to recurse, the judge will just see these two alternative answers and won’t trust the dishonest answer.
        So in this case it’s a stalemate, presumably? If the two players disagree but neither pays to recurse, how should the judge make a decision?
        Beth Barnes 22 Nov 2020 6:04 UTC
        LW: 12 AF: 9
        AF Parent
        Both debaters make claims. Any claims that are only supported by circular arguments will be ignored. If an honest claim that’s supported by a good argument is disputed, the honest debater will pay to recurse, and will give their good argument
        abramdemski 19 Jan 2021 0:00 UTC
        LW: 4 AF: 4
        AF Parent
        This was a very interesting comment (along with its grandparent comment), thanks—it seems like a promising direction.
        However, I’m still confused about whether this would work. It’s very different from judging procedure outlined here; why is that? Do you have a similarly detailed write-up of the system you’re describing here?
        I’m actually less concerned about loops and more concerned about arguments which are infinite trees, but the considerations are similar. It seems possible that the proposal you’re discussing very significantly addresses concerns I’ve had about debate.
        Beth Barnes 31 Jan 2021 3:54 UTC
        LW: 1 AF: 1
        AF Parent
        I was trying to describe something that’s the same as the judging procedure in that doc! I might have made a mistake, but I’m pretty sure the key piece about recursion payments is the same. Apologies that things are unclear. I’m happy to try to clarify, if there were particular aspects that seem different to you.
        Yeah, I think the infinite tree case should work just the same—ie an answer that’s only supported by an infinite tree will behave like an answer that’s not supported (it will lose to an answer with a finite tree and draw with an answer with no support)
        It seems possible that the proposal you’re discussing very significantly addresses concerns I’ve had about debate.
        That’s exciting!
  - abramdemski 16 Jul 2020 18:50 UTC
    LW: 6 AF: 5
    AF Parent
    Ah, ok. Agreed.