TurnTrout comments on Don’t design agents which exploit adversarial inputs

TurnTrout 21 Nov 2022 20:16 UTC
LW: 10 AF: 6
8
AF
if you design an AI that doesn’t argmax (or do its best to approximate argmax), and someone else builds one that does, your AI will lose a fair fight (e.g., economic competition starting from equal capabilities and resource endowments), so it would be nice if alignment doesn’t mean giving up argmax.
This seems exactly backwards to me. Argmax violates the non-adversarial principle and wastes computation. Argmax requires you to spend effort hardening your own utility function against the effort you’re also expending searching across all possible inputs to your utility function (including the adversarial inputs!). For example, if I argmaxed over my own plan-evaluations, I’d have to consider the most terrifying-to-me basilisks possible, and rate none of them unusually highly. I’d have to spend effort hardening my own ability to evaluate plans, in order to safely consider those possibilities.
It would be far wiser to not consider all possible plans, and instead close off large parts of the search space. You can consider what plans to think about next, and how long to think, and so on. And then you aren’t argmaxing. You’re using resources effectively.
For example, some infohazardous thoughts exist (like hyper-optimized-against-you basilisks) which are dangerous to think about (although most thoughts are probably safe). But an agent which plans its next increment of planning using a reflective self-model is IMO not going to be like “hey it would be predicted-great if I spent the next increment of time thinking about an entity which is trying to manipulate me.” So e.g. a reflective agent trying to actually win with the available resources, wouldn’t do something dumb like “run argmax” or “find the plan which some part of me evaluates most highly.”
(See Charles Foster’s comment for another perspective here.)
If I was doing the evaluation, I wouldn’t look at the plan directly but spend the first 4999 years slowly and carefully upgrading myself and my AI helpers, and then if I’m still not sure I can safely evaluate a plan, I would just throw an exception or return an error code instead of looking at the plan.
Unless this grader procedure implements a perfectly robust mathematical (plan input)->(grade output) function, you get hacked.
What links here?
- Don’t align agents to evaluations of plans by TurnTrout (26 Nov 2022 21:16 UTC; 45 points)
- TurnTrout's comment on Don’t design agents which exploit adversarial inputs by TurnTrout (21 Nov 2022 20:57 UTC; 3 points)
- Wei Dai 22 Nov 2022 1:42 UTC
  LW: 3 AF: 3
  0
  AF Parent
  
  It would be far wiser to not consider all possible plans, and instead close off large parts of the search space. You can consider what plans to think about next, and how long to think, and so on. And then you aren’t argmaxing. You’re using resources effectively.
  
  But aren’t you still argmaxing within the space of plans that you haven’t closed off (or are actively considering), and still taking a risk of finding some adversarial plan within that space? (Humans get scammed and invent or fall into cults and crazy ideologies not infrequently, despite doing what you’re describing here already.) How do you just “not argmax” or “not design agents which exploit adversarial inputs”?
  
  Maybe there’s no substantive disagreement here, merely an issue of presentation/communication? I.e., when you say “you aren’t argmaxing” perhaps you don’t mean “don’t ever use argmax anywhere” but instead “don’t argmax over the whole plan space” and by “don’t design agents which exploit adversarial inputs” you mean something like “we should try to find ways to avoid or reduce the risk adversarial inputs”?
  - TurnTrout 24 Nov 2022 1:06 UTC
    LW: 2 AF: 2
    0
    AF Parent
    I.e., when you say “you aren’t argmaxing” perhaps you don’t mean “don’t ever use argmax anywhere” but instead “don’t argmax over the whole plan space”
    I was primarily critiquing “argmax over the whole plan space.” I do caution that I think it’s extremely important to not round off “iterative, reflective planning and reasoning” as “restricted argmax”, because that obscures the dynamics and results of real-world cognition. Argmax is also a bad model of what people are doing when they think, and how I expect realistic embedded agents to think.
    “don’t design agents which exploit adversarial inputs” you mean something like “we should try to find ways to avoid or reduce the risk adversarial inputs”
    No, I mean: don’t design agents which are motivated to find and exploit adversarial inputs. Don’t align an agent to evaluations which are only nominally about diamonds, and then expect the agent to care about diamonds! You wouldn’t align an agent to care about cows and then be surprised that it didn’t care about diamonds. Why be surprised here?
    - TurnTrout 24 Nov 2022 1:06 UTC
      LW: 5 AF: 3
      2
      AF Parent
      I wrote a bunch more before realizing that we maybe don’t disagree fully on the “don’t argmax” point. Here:
      But aren’t you still argmaxing within the space of plans that you haven’t closed off (or are actively considering),
      Not really? I think it is inappropriately suggestive to describe this as “argmaxing.” I, for one, usually feel like I consider at most three plans during most planning sessions. Most of the work is going to be in my generative models, in my learned habits of thought, in my snap reflective assessments of what I should think about next.
      How many different plans do you consider for going to the store? For writing a LessWrong post? Even if you did consider more plans, you’d convergently want to explore parts of the plan-space which you think won’t contain secret adversarial examples to your own evaluations. EG at first pass, just don’t think about entities trying to acausally blackmail you.
      Argmax is an abstraction which may or may not actually describe a given cognitive process. I think that if we label reflective incremental planning and reasoning as “argmax”, we’re missing a serious opportunity for original thought, for considering in detail what the algorithm does.
      and still taking a risk of finding some adversarial plan within that space?
      There is indeed a risk you’ll find an adversarial plan. But what is the risk, quantitatively? A reflective agent will convergently wish to avoid thinking about plans which exploit its own evaluation procedures and reasoning (eg tricking the diamond-shard into bidding for plans). In stark contrast, grader-optimizers and argmaxers convergently want to exploit those procedures, so as to achieve higher diamond-evaluations.
      How do you just “not argmax” or “not design agents which exploit adversarial inputs”?
      First of all, alignment researchers should stop trying to terminally motivate agents to optimize evaluations of their plans or outcomes. That’s doomed and doesn’t make sense.
      Second, A shot at the diamond alignment problem describes an agent which isn’t trying to exploit some diamond-grader. I didn’t do anything in particular in order to avoid training an agent which exploits adversarial inputs to a diamond-grader function. I think that you just don’t get that problem at all, unless you’re assuming cognition must decompose via the (IMO) strange frame of “outer/inner alignment.”
      (Humans get scammed and invent or fall into cults and crazy ideologies not infrequently, despite doing what you’re describing here already.)
      Note the presence of adversarial optimizers in most of these situations. The adversarial optimization comes from other people who are optimizing ideas to get spurious buy-in from victims.
      I expect that smart agents convergently wish to minimize the optimizer’s curse, because that leads to more of what they want.
      What links here?
      Don’t align agents to evaluations of plans by TurnTrout (26 Nov 2022 21:16 UTC; 45 points)
      - Wei Dai 24 Nov 2022 2:59 UTC
        LW: 5 AF: 5
        3
        AF Parent
        Thanks for this longer reply and the link to your diamond alignment post, which help me understand your thinking better. I’m sympathetic to a lot of what you say, but feel like you tend to state your conclusions more strongly than the underlying arguments warrant.
        
        The adversarial optimization comes from other people who are optimizing ideas to get spurious buy-in from victims.
        
        I think a lot of crazy religions/ideologies/philosophies come from people genuinely trying to answer hard questions for themselves, but there are also some that are deliberate attempts to optimize against others (Scientology?).
        
        Daniel Kokotajlo described an analogy with EA, which you were going to answer but still haven’t. I would add that EA funders have to consider even more (and potentially more explicitly adversarial) proposals/plans than a typical EA, and AFAIK nobody has suggested that that’s doomed from the start because it amounts to argmaxing against an evaluator. Instead, everyone implicitly or explicitly recognizes the danger of adversarial plans, and tries to harden the evaluation process against them.
        TurnTrout 26 Nov 2022 4:02 UTC
        LW: 2 AF: 2
        0
        AF Parent
        I agree that humans sometimes fall prey to adversarial inputs, and am updating up on dangerous-thought density based on your religion argument. Any links to where I can read more?
        However, this does not seem important for my (intended) original point. Namely, if you’re trying to align e.g. a brute-force-search plan maximizer or a grader-optimizer, you will fail due to high-strength optimizer’s curse forcing you to evaluate extremely scary adversarial inputs. But also this is sideways of real-world alignment, where realistic motivations may not be best specified in the form of “utility function over observation/universe histories.”
        there are also some that are deliberate attempts to optimize against others (Scientology?).
        (Also, major religions are presumably memetically optimized. No deliberate choice required, on my model.)
        Daniel Kokotajlo described an analogy with EA, which you were going to answer but still haven’t.
        Answered now.
        I would add that EA funders have to consider even more (and potentially more explicitly adversarial) proposals/plans than a typical EA, and AFAIK nobody has suggested that that’s doomed from the start because it amounts to argmaxing against an evaluator. Instead, everyone implicitly or explicitly recognizes the danger of adversarial plans, and tries to harden the evaluation process against them.
        This seems disanalogous to the situation discussed in the OP. If we were designing, from scratch, a system which we wanted to pursue effective altruism, we would be extremely well-advised to not include grader-optimizers which are optimizing EA funder evaluations. Especially if the grader-optimizers will eventually get smart enough to write out the funders’ pseudocode. At best, that wastes computation. At (probable) worst, the system blows up.
        By contrast, we live in a world full of other people, some of whom are optimizing for status and power. Given that world, we should indeed harden our evaluation procedures, insofar as that helps us more faithfully evaluate grants and thereby achieve our goals.
        What links here?
        Don’t align agents to evaluations of plans by TurnTrout (26 Nov 2022 21:16 UTC; 45 points)
        Wei Dai 26 Nov 2022 7:34 UTC
        LW: 2 AF: 2
        0
        AF Parent
        
        I agree that humans sometimes fall prey to adversarial inputs, and am updating up on dangerous-thought density based on your religion argument. Any links to where I can read more?
        
        Maybe https://en.wikipedia.org/wiki/Extraordinary_Popular_Delusions_and_the_Madness_of_Crowds (I don’t mean read this book, which I haven’t either, but you could use the wiki article to familiarize yourself with the historical episodes that the book talks about.) See also https://en.wikipedia.org/wiki/Heaven’s_Gate_(religious_group)
        
        However, this does not seem important for my (intended) original point. Namely, if you’re trying to align e.g. a brute-force-search plan maximizer or a grader-optimizer, you will fail due to high-strength optimizer’s curse forcing you to evaluate extremely scary adversarial inputs. But also this is sideways of real-world alignment, where realistic motivations may not be best specified in the form of “utility function over observation/universe histories.”
        
        My counterpoint here is, we have an example of human-aligned shard-based agents (namely humans), who are nevertheless unsafe in part because they fall prey to dangerous thoughts, which they themselves generate because they inevitably have to do some amount of search/optimization (of their thoughts/plans) as they try to reach their goals, and dangerous-thought density is high enough that even that limited amount of search/optimization is enough to frequently (on a societal level) hit upon dangerous thoughts.
        
        Wouldn’t a shard-based aligned AI have to do as much search/optimization as a human society collectively does, in order to be as competent/intelligent, in which case wouldn’t it be as likely to be unsafe in this regard? And what if it has an even higher density of dangerous thoughts, especially “out of distribution”, and/or does more search/optimization to try to find better-than-human thoughts/plans?
        
        (My own proposal here is to try to solve metaphilosophy or understand “correct reasoning” so that we / our AIs are able to safely think any thought or evaluation any plan, or at least have some kind of systematic understanding of what thoughts/plans are dangerous to think about. Or work on some more indirect way of eventually achieving something like this.)