Steven Byrnes comments on Don’t design agents which exploit adversarial inputs

Steven Byrnes 18 Nov 2022 15:21 UTC
LW: 3 AF: 3
−7
AF
I agree—I think “Optimizing for the output of a grader which evaluates plans” is more-or-less how human brains choose plans, and I don’t think it’s feasible to make an AGI that doesn’t do that.
But it sounds like this will be the topic of Alex’s next essay.
So I’m expecting to criticize Alex’s next essay by commenting on it along the lines of: “You think you just wrote an essay about something which is totally different from “Optimizing for the output of a grader which evaluates plans”, but I disagree; the thing you’re describing in this essay is in that category too.” But that’s just a guess; I will let Alex write the essay before I criticize it. :-P
- Quintin Pope 18 Nov 2022 17:49 UTC
  7 points
  4
  Parent
  IMO, what the brain does is a bit like classifier guided diffusion, where it has a generative model of plausible plans to do X, then mixes this prior with the gradients from some “does this plan actually accomplish X?” classifier.
  
  This is not equivalent to finding a plan that maximises the score of the “does this plan actually accomplish X?” classifier. If you were to discard the generative prior and choose your plan by argmaxing the classifier’s score, you’d get some nonsensical adversarial noise (or maybe some insane, but technically coherent plan, like “plan to make a plan to make a plan to … do X”).
  - cfoster0 18 Nov 2022 19:14 UTC
    7 points
    6
    Parent
    It sounds like some people have an intuition that the mental algorithms “sample from a conditional generative model” and “search for the argmax / epsilon-close-to-argmax input to a scoring function” are effectively the same. I don’t share that intuition and struggle to communicate across that divide. Like, when I think about it through ML examples (GPT, diffusion models, etc.), those are two very different pieces of code that produce two very different kinds of outputs.
    - tailcalled 18 Nov 2022 21:03 UTC
      3 points
      −3
      Parent
      I believe sampling from a conditional distribution is basically equivalent to adding a “cost of action” (where “action” = deviating from the generative model) to argmax search.
      - cfoster0 18 Nov 2022 21:59 UTC
        1 point
        0
        Parent
        If you have time, I think it’d be valuable for you to make a case for that.
        tailcalled 18 Nov 2022 22:32 UTC
        8 points
        0
        Parent
        Suppose $M$ is your prior distribution, $u$ is your utility function, and you are selecting some policy distribution $Π$ so as to maximize $E (u | Π) - K L (Π | | M)$ . Here the first term represents the standard utility maximization objective whereas the second term represents a cost of action. This expands into $\int u - log \frac{Π}{M} d Π$ , which is equivalent to minimizing $\int log \frac{Π}{M e^{u}} d Π$ or in other words $K L (Π | | M e^{u})$ , which happens when $Π \propto M e^{u}$ . (I think, I’m rusty on this math so I might have made a mistake.)
        This is not 100% equivalent to letting $Π$ be a Bayesian conditioned version of $M$ because Bayesian conditioning involves multiplying $M$ by an indicator function whereas this involves multiplying $M$ by a strictly positive function, but it seems related and probably shares most of its properties.
        What links here?
        Soft optimization makes the value target bigger by Jeremy Gillen (2 Jan 2023 16:06 UTC; 117 points)
        tailcalled's comment on Utilitarianism Meets Egalitarianism by Scott Garrabrant (21 Nov 2022 21:32 UTC; 7 points)
        cfoster0 21 Nov 2022 21:52 UTC
        3 points
        0
        Parent
        The two of us went back and forth in DMs on this for a bit. Based on that conversation, I think a mutually-agreeable translation of the above argument would be “sampling from [the conditional distribution of X-es given the Y label] is the same as sampling from [the distribution that has maximum joint [[closeness to the distribution of X-es] and [prevalence of Y-labeled X-es]]]”. Even if this isn’t exact, I buy that as at least morally true.
        However, I don’t think this establishes the claim I’d been struggling with, which was that there’s some near equivalence between drawing a conditional sample and argmax searching over samples (possibly w/ some epsilon tolerance). The above argument establishes that we can view conditioning itself as the solution to a maximization problem over distributions, but not that we can view conditional sampling as the solution to any kind of maximization problem over samples.
        tailcalled 21 Nov 2022 22:39 UTC
        3 points
        0
        Parent
        I would also add that the key exciting things happen when you condition on an event with extremely low probability / have a utility function with an extremely wide range of available utilities. cfoster0′s view is that this will mostly just cause it to fail/output nonsense, because of standard arguments along the lines of the Optimizer’s Curse. I agree that this could happen, but I think it depends on the intelligence of the argmaxer/conditioner, and that another possibility (if we had more capable AI) is that this sort of optimization/conditioning could have a lot of robust effects on reality.
        Quintin Pope 24 Nov 2022 3:05 UTC
        2 points
        0
        Parent
        I can’t see a clear mistake in the math here, but it seems fairly straightforwards to construct a counterexample to the conclusion of equivalence the math naively points to.
        Suppose we want to use GPT-3 to generate a 600 token long essay praising some company X. Here are two ways we might do this:
        Prompt GPT-3 to generate the essay, sample 5 continuations, and then use a sentiment classifier to select the most positive sentiment of those completions.
        Prompt GPT-3 to generate the essay, then score every possible continuation by the classifier’s sentiment score - $λ$ the logprob of the continuation.
        I expect that the first method will mostly give you reasonable results, assuming you use text-davinci-002. However, I think the second method will tend to give you extremely degenerate solutions such as “good good good good...” for 600 tokens.
        One possible reason for this divide is that GPTs aren’t really a prior over language, but a prior over single token continuations of a given natural language context. When you try to make it act like a prior over an entire essay, you expose it to inputs that are very OOD relative to the distribution it’s calibrated to model, including inputs that have significant upwards errors in their probability estimations.
        However, I think a “perfect” model of human language might actually assign higher prior probability to a continuation like “good good good...” (or maybe something like “X is good because X is good because X is good...”) than to a “natural” continuation, provided you made the continuations long enough. This is because the number of possible natural continuations is roughly exponential in the length of the continuation (assuming entropy per character remains ~constant), while there are far fewer possible degenerate continuations (their entropy decreases very quickly). While the probability of entering a degenerate continuation may be very low, you make up for it with the reduced branching factor.
        tailcalled 24 Nov 2022 7:37 UTC
        3 points
        1
        Parent
        The error is that the KL divergence term doesn’t mean adding a cost proportional to the log probability of the continuation. In fact it’s not expressible at all in terms of argmaxing over a single continuation, but instead requires you to be argmaxing over a distribution of continuations.
        cfoster0 18 Nov 2022 23:39 UTC
        1 point
        0
        Parent
        (Haven’t double-checked the math or fully grokked the argument behind it, but strongly upvoted for making a case.)
        tailcalled 19 Nov 2022 21:58 UTC
        3 points
        0
        Parent
        I would be curious to know if it makes sense to anyone or if anyone agrees/disagrees.
        Quintin Pope 19 Nov 2022 22:06 UTC
        2 points
        0
        Parent
        Seems like you can always implement any function f: X → Y as a search process. For any input from the domain X, just make the search objective assign one to f(X) and zero to everything else. Then argmax over this objective.
        tailcalled 19 Nov 2022 22:17 UTC
        2 points
        0
        Parent
        Yes but my point uses a different approach to the translation, and so it seems like my point allows various standard arguments about argmax to also infect conditioning, whereas your proposed equivalence doesn’t really provide any way for standard argmax arguments to transfer.
  - Steven Byrnes 18 Nov 2022 18:46 UTC
    4 points
    −1
    Parent
    Wouldn’t that be “Optimizing for the output of a grader which evaluates plans”, where one of the items on the grading rubric is “This plan is in-distribution”?
    - tailcalled 18 Nov 2022 19:33 UTC
      2 points
      0
      Parent
      Maybe if you have a good measure of being in-distribution, which itself is a nontrivial problem.
  - tailcalled 18 Nov 2022 18:28 UTC
    4 points
    −7
    Parent
    This sounds like a reinvention of quantilization, and yes that’s a thing you can do to improve safety, but 1. you still need your prior over plans to come from somewhere (perhaps you start out with something IRL-like, and then update it based on experience of what worked, which brings you back to square one), 2. it just gives you a safety-capabilities tradeoff dial rather than particularly solving safety.
    - tailcalled 18 Nov 2022 18:41 UTC
      2 points
      0
      Parent
      Or hmm...
      If you do basic reinforcement based on experience, then that’s an unbounded adversarial search, but it’s really slow and therefore might be safe. And it also raises the question of whether there are other safer approaches.
- TurnTrout 21 Nov 2022 20:57 UTC
  LW: 3 AF: 3
  1
  AF Parent
  See my comment to Wei Dai. Argmax’s violation of the adversarial principle strongly suggests the existence of a better and more natural frame on the problem.
  I think “Optimizing for the output of a grader which evaluates plans” is more-or-less how human brains choose plans
  I deeply disagree. I think you might be conflating the quotation and the referent, two different patterns:
  1. local semi-reflective search
    what I think people do.
    “does it make sense to spend another hour thinking of alternatives?
    
    [self-model says ‘yes’]
    
    OK, I will”
    “do I predict ‘search for plans which would most persuade me of their virtue’ to actually lead to virtuous plans? [Self-model says ‘no’] Moving on...”
  2. global search against the output of an evaluation function implemented within the brain
    grader-optimization
    “what kinds of plans would my brain like the most?”
  It is possible to say sentences like “local semi-reflective search just is global search but with implicit constraints like ‘select for plans which your self-model likes’.” I don’t think this is true. I am going to posit that, as a matter of falsifiable physical fact, the human brain does not compute a predicate which, when checked against all possible plans, rules out all adversarial plans, such that you can just argmax over everything and get out what the person would have chosen/would have wanted to choose on reflection. If you argmax over human value shards relative to the plans they might grade, you’ll probably get some garbage plan where you’re, like, twitching on the floor.
  I don’t think it’s feasible to make an AGI that doesn’t do that.
  You’ll notice that A shot at the diamond alignment problem makes no claim of the AGI having an internal-argmax-hardened diamond value shard.