tailcalled comments on Don’t design agents which exploit adversarial inputs

tailcalled 18 Nov 2022 22:32 UTC
8 points
0
Suppose $M$ is your prior distribution, $u$ is your utility function, and you are selecting some policy distribution $Π$ so as to maximize $E (u | Π) - K L (Π | | M)$ . Here the first term represents the standard utility maximization objective whereas the second term represents a cost of action. This expands into $\int u - log \frac{Π}{M} d Π$ , which is equivalent to minimizing $\int log \frac{Π}{M e^{u}} d Π$ or in other words $K L (Π | | M e^{u})$ , which happens when $Π \propto M e^{u}$ . (I think, I’m rusty on this math so I might have made a mistake.)
This is not 100% equivalent to letting $Π$ be a Bayesian conditioned version of $M$ because Bayesian conditioning involves multiplying $M$ by an indicator function whereas this involves multiplying $M$ by a strictly positive function, but it seems related and probably shares most of its properties.
What links here?
- Soft optimization makes the value target bigger by Jeremy Gillen (2 Jan 2023 16:06 UTC; 117 points)
- tailcalled's comment on Utilitarianism Meets Egalitarianism by Scott Garrabrant (21 Nov 2022 21:32 UTC; 7 points)
- cfoster0 21 Nov 2022 21:52 UTC
  3 points
  0
  Parent
  The two of us went back and forth in DMs on this for a bit. Based on that conversation, I think a mutually-agreeable translation of the above argument would be “sampling from [the conditional distribution of X-es given the Y label] is the same as sampling from [the distribution that has maximum joint [[closeness to the distribution of X-es] and [prevalence of Y-labeled X-es]]]”. Even if this isn’t exact, I buy that as at least morally true.
  However, I don’t think this establishes the claim I’d been struggling with, which was that there’s some near equivalence between drawing a conditional sample and argmax searching over samples (possibly w/ some epsilon tolerance). The above argument establishes that we can view conditioning itself as the solution to a maximization problem over distributions, but not that we can view conditional sampling as the solution to any kind of maximization problem over samples.
  - tailcalled 21 Nov 2022 22:39 UTC
    3 points
    0
    Parent
    I would also add that the key exciting things happen when you condition on an event with extremely low probability / have a utility function with an extremely wide range of available utilities. cfoster0′s view is that this will mostly just cause it to fail/output nonsense, because of standard arguments along the lines of the Optimizer’s Curse. I agree that this could happen, but I think it depends on the intelligence of the argmaxer/conditioner, and that another possibility (if we had more capable AI) is that this sort of optimization/conditioning could have a lot of robust effects on reality.
- Quintin Pope 24 Nov 2022 3:05 UTC
  2 points
  0
  Parent
  I can’t see a clear mistake in the math here, but it seems fairly straightforwards to construct a counterexample to the conclusion of equivalence the math naively points to.
  Suppose we want to use GPT-3 to generate a 600 token long essay praising some company X. Here are two ways we might do this:
  1. Prompt GPT-3 to generate the essay, sample 5 continuations, and then use a sentiment classifier to select the most positive sentiment of those completions.
  2. Prompt GPT-3 to generate the essay, then score every possible continuation by the classifier’s sentiment score - $λ$ the logprob of the continuation.
  I expect that the first method will mostly give you reasonable results, assuming you use text-davinci-002. However, I think the second method will tend to give you extremely degenerate solutions such as “good good good good...” for 600 tokens.
  One possible reason for this divide is that GPTs aren’t really a prior over language, but a prior over single token continuations of a given natural language context. When you try to make it act like a prior over an entire essay, you expose it to inputs that are very OOD relative to the distribution it’s calibrated to model, including inputs that have significant upwards errors in their probability estimations.
  However, I think a “perfect” model of human language might actually assign higher prior probability to a continuation like “good good good...” (or maybe something like “X is good because X is good because X is good...”) than to a “natural” continuation, provided you made the continuations long enough. This is because the number of possible natural continuations is roughly exponential in the length of the continuation (assuming entropy per character remains ~constant), while there are far fewer possible degenerate continuations (their entropy decreases very quickly). While the probability of entering a degenerate continuation may be very low, you make up for it with the reduced branching factor.
  - tailcalled 24 Nov 2022 7:37 UTC
    3 points
    1
    Parent
    The error is that the KL divergence term doesn’t mean adding a cost proportional to the log probability of the continuation. In fact it’s not expressible at all in terms of argmaxing over a single continuation, but instead requires you to be argmaxing over a distribution of continuations.
- cfoster0 18 Nov 2022 23:39 UTC
  1 point
  0
  Parent
  (Haven’t double-checked the math or fully grokked the argument behind it, but strongly upvoted for making a case.)
  - tailcalled 19 Nov 2022 21:58 UTC
    3 points
    0
    Parent
    I would be curious to know if it makes sense to anyone or if anyone agrees/disagrees.
    - Quintin Pope 19 Nov 2022 22:06 UTC
      2 points
      0
      Parent
      Seems like you can always implement any function f: X → Y as a search process. For any input from the domain X, just make the search objective assign one to f(X) and zero to everything else. Then argmax over this objective.
      - tailcalled 19 Nov 2022 22:17 UTC
        2 points
        0
        Parent
        Yes but my point uses a different approach to the translation, and so it seems like my point allows various standard arguments about argmax to also infect conditioning, whereas your proposed equivalence doesn’t really provide any way for standard argmax arguments to transfer.