Wei Dai comments on Don’t design agents which exploit adversarial inputs

Wei Dai 18 Nov 2022 5:15 UTC
LW: 8 AF: 6
2
AF
(I hesitated to post these comments in case they’re not relevant to the main point you’re trying to make or will be addressed in the next post. Feel free to ignore if that’s the case.)

Value-child: The mother makes her kid care about working hard and behaving well.

How does one do this? (Not entirely rhetorical.)

Amplified humans spend 5,000 years thinking about how many diamonds the plan produces in the next 100 years, and write down their conclusions as the expected utility of the plan.

Due to the exponentially large plan space and the fact that humans are not cognitively secure systems, there exists a long sequence of action commands which cognitively impairs all of the humans and makes them prematurely stop the search and return a huge number.

If I was doing the evaluation, I wouldn’t look at the plan directly but spend the first 4999 years slowly and carefully upgrading myself and my AI helpers, and then if I’m still not sure I can safely evaluate a plan, I would just throw an exception or return an error code instead of looking at the plan.

This lets us abstract away e.g. seemingly annoying complications with reflective agents which think about their future planning process. This seemingly[4] relaxes the problem.

Another reason to think about argmax in relation to AI safety/alignment is if you design an AI that doesn’t argmax (or do its best to approximate argmax), and someone else builds one that does, your AI will lose a fair fight (e.g., economic competition starting from equal capabilities and resource endowments), so it would be nice if alignment doesn’t mean giving up argmax.
- TurnTrout 21 Nov 2022 20:16 UTC
  LW: 10 AF: 6
  8
  AF Parent
  if you design an AI that doesn’t argmax (or do its best to approximate argmax), and someone else builds one that does, your AI will lose a fair fight (e.g., economic competition starting from equal capabilities and resource endowments), so it would be nice if alignment doesn’t mean giving up argmax.
  This seems exactly backwards to me. Argmax violates the non-adversarial principle and wastes computation. Argmax requires you to spend effort hardening your own utility function against the effort you’re also expending searching across all possible inputs to your utility function (including the adversarial inputs!). For example, if I argmaxed over my own plan-evaluations, I’d have to consider the most terrifying-to-me basilisks possible, and rate none of them unusually highly. I’d have to spend effort hardening my own ability to evaluate plans, in order to safely consider those possibilities.
  It would be far wiser to not consider all possible plans, and instead close off large parts of the search space. You can consider what plans to think about next, and how long to think, and so on. And then you aren’t argmaxing. You’re using resources effectively.
  For example, some infohazardous thoughts exist (like hyper-optimized-against-you basilisks) which are dangerous to think about (although most thoughts are probably safe). But an agent which plans its next increment of planning using a reflective self-model is IMO not going to be like “hey it would be predicted-great if I spent the next increment of time thinking about an entity which is trying to manipulate me.” So e.g. a reflective agent trying to actually win with the available resources, wouldn’t do something dumb like “run argmax” or “find the plan which some part of me evaluates most highly.”
  (See Charles Foster’s comment for another perspective here.)
  If I was doing the evaluation, I wouldn’t look at the plan directly but spend the first 4999 years slowly and carefully upgrading myself and my AI helpers, and then if I’m still not sure I can safely evaluate a plan, I would just throw an exception or return an error code instead of looking at the plan.
  Unless this grader procedure implements a perfectly robust mathematical (plan input)->(grade output) function, you get hacked.
  What links here?
  - Don’t align agents to evaluations of plans by TurnTrout (26 Nov 2022 21:16 UTC; 45 points)
  - TurnTrout's comment on Don’t design agents which exploit adversarial inputs by TurnTrout (21 Nov 2022 20:57 UTC; 3 points)
  - Wei Dai 22 Nov 2022 1:42 UTC
    LW: 3 AF: 3
    0
    AF Parent
    
    It would be far wiser to not consider all possible plans, and instead close off large parts of the search space. You can consider what plans to think about next, and how long to think, and so on. And then you aren’t argmaxing. You’re using resources effectively.
    
    But aren’t you still argmaxing within the space of plans that you haven’t closed off (or are actively considering), and still taking a risk of finding some adversarial plan within that space? (Humans get scammed and invent or fall into cults and crazy ideologies not infrequently, despite doing what you’re describing here already.) How do you just “not argmax” or “not design agents which exploit adversarial inputs”?
    
    Maybe there’s no substantive disagreement here, merely an issue of presentation/communication? I.e., when you say “you aren’t argmaxing” perhaps you don’t mean “don’t ever use argmax anywhere” but instead “don’t argmax over the whole plan space” and by “don’t design agents which exploit adversarial inputs” you mean something like “we should try to find ways to avoid or reduce the risk adversarial inputs”?
    - TurnTrout 24 Nov 2022 1:06 UTC
      LW: 2 AF: 2
      0
      AF Parent
      I.e., when you say “you aren’t argmaxing” perhaps you don’t mean “don’t ever use argmax anywhere” but instead “don’t argmax over the whole plan space”
      I was primarily critiquing “argmax over the whole plan space.” I do caution that I think it’s extremely important to not round off “iterative, reflective planning and reasoning” as “restricted argmax”, because that obscures the dynamics and results of real-world cognition. Argmax is also a bad model of what people are doing when they think, and how I expect realistic embedded agents to think.
      “don’t design agents which exploit adversarial inputs” you mean something like “we should try to find ways to avoid or reduce the risk adversarial inputs”
      No, I mean: don’t design agents which are motivated to find and exploit adversarial inputs. Don’t align an agent to evaluations which are only nominally about diamonds, and then expect the agent to care about diamonds! You wouldn’t align an agent to care about cows and then be surprised that it didn’t care about diamonds. Why be surprised here?
      - TurnTrout 24 Nov 2022 1:06 UTC
        LW: 5 AF: 3
        2
        AF Parent
        I wrote a bunch more before realizing that we maybe don’t disagree fully on the “don’t argmax” point. Here:
        But aren’t you still argmaxing within the space of plans that you haven’t closed off (or are actively considering),
        Not really? I think it is inappropriately suggestive to describe this as “argmaxing.” I, for one, usually feel like I consider at most three plans during most planning sessions. Most of the work is going to be in my generative models, in my learned habits of thought, in my snap reflective assessments of what I should think about next.
        How many different plans do you consider for going to the store? For writing a LessWrong post? Even if you did consider more plans, you’d convergently want to explore parts of the plan-space which you think won’t contain secret adversarial examples to your own evaluations. EG at first pass, just don’t think about entities trying to acausally blackmail you.
        Argmax is an abstraction which may or may not actually describe a given cognitive process. I think that if we label reflective incremental planning and reasoning as “argmax”, we’re missing a serious opportunity for original thought, for considering in detail what the algorithm does.
        and still taking a risk of finding some adversarial plan within that space?
        There is indeed a risk you’ll find an adversarial plan. But what is the risk, quantitatively? A reflective agent will convergently wish to avoid thinking about plans which exploit its own evaluation procedures and reasoning (eg tricking the diamond-shard into bidding for plans). In stark contrast, grader-optimizers and argmaxers convergently want to exploit those procedures, so as to achieve higher diamond-evaluations.
        How do you just “not argmax” or “not design agents which exploit adversarial inputs”?
        First of all, alignment researchers should stop trying to terminally motivate agents to optimize evaluations of their plans or outcomes. That’s doomed and doesn’t make sense.
        Second, A shot at the diamond alignment problem describes an agent which isn’t trying to exploit some diamond-grader. I didn’t do anything in particular in order to avoid training an agent which exploits adversarial inputs to a diamond-grader function. I think that you just don’t get that problem at all, unless you’re assuming cognition must decompose via the (IMO) strange frame of “outer/inner alignment.”
        (Humans get scammed and invent or fall into cults and crazy ideologies not infrequently, despite doing what you’re describing here already.)
        Note the presence of adversarial optimizers in most of these situations. The adversarial optimization comes from other people who are optimizing ideas to get spurious buy-in from victims.
        I expect that smart agents convergently wish to minimize the optimizer’s curse, because that leads to more of what they want.
        What links here?
        Don’t align agents to evaluations of plans by TurnTrout (26 Nov 2022 21:16 UTC; 45 points)
        Wei Dai 24 Nov 2022 2:59 UTC
        LW: 5 AF: 5
        3
        AF Parent
        Thanks for this longer reply and the link to your diamond alignment post, which help me understand your thinking better. I’m sympathetic to a lot of what you say, but feel like you tend to state your conclusions more strongly than the underlying arguments warrant.
        
        The adversarial optimization comes from other people who are optimizing ideas to get spurious buy-in from victims.
        
        I think a lot of crazy religions/ideologies/philosophies come from people genuinely trying to answer hard questions for themselves, but there are also some that are deliberate attempts to optimize against others (Scientology?).
        
        Daniel Kokotajlo described an analogy with EA, which you were going to answer but still haven’t. I would add that EA funders have to consider even more (and potentially more explicitly adversarial) proposals/plans than a typical EA, and AFAIK nobody has suggested that that’s doomed from the start because it amounts to argmaxing against an evaluator. Instead, everyone implicitly or explicitly recognizes the danger of adversarial plans, and tries to harden the evaluation process against them.
        TurnTrout 26 Nov 2022 4:02 UTC
        LW: 2 AF: 2
        0
        AF Parent
        I agree that humans sometimes fall prey to adversarial inputs, and am updating up on dangerous-thought density based on your religion argument. Any links to where I can read more?
        However, this does not seem important for my (intended) original point. Namely, if you’re trying to align e.g. a brute-force-search plan maximizer or a grader-optimizer, you will fail due to high-strength optimizer’s curse forcing you to evaluate extremely scary adversarial inputs. But also this is sideways of real-world alignment, where realistic motivations may not be best specified in the form of “utility function over observation/universe histories.”
        there are also some that are deliberate attempts to optimize against others (Scientology?).
        (Also, major religions are presumably memetically optimized. No deliberate choice required, on my model.)
        Daniel Kokotajlo described an analogy with EA, which you were going to answer but still haven’t.
        Answered now.
        I would add that EA funders have to consider even more (and potentially more explicitly adversarial) proposals/plans than a typical EA, and AFAIK nobody has suggested that that’s doomed from the start because it amounts to argmaxing against an evaluator. Instead, everyone implicitly or explicitly recognizes the danger of adversarial plans, and tries to harden the evaluation process against them.
        This seems disanalogous to the situation discussed in the OP. If we were designing, from scratch, a system which we wanted to pursue effective altruism, we would be extremely well-advised to not include grader-optimizers which are optimizing EA funder evaluations. Especially if the grader-optimizers will eventually get smart enough to write out the funders’ pseudocode. At best, that wastes computation. At (probable) worst, the system blows up.
        By contrast, we live in a world full of other people, some of whom are optimizing for status and power. Given that world, we should indeed harden our evaluation procedures, insofar as that helps us more faithfully evaluate grants and thereby achieve our goals.
        What links here?
        Don’t align agents to evaluations of plans by TurnTrout (26 Nov 2022 21:16 UTC; 45 points)
        Wei Dai 26 Nov 2022 7:34 UTC
        LW: 2 AF: 2
        0
        AF Parent
        
        I agree that humans sometimes fall prey to adversarial inputs, and am updating up on dangerous-thought density based on your religion argument. Any links to where I can read more?
        
        Maybe https://en.wikipedia.org/wiki/Extraordinary_Popular_Delusions_and_the_Madness_of_Crowds (I don’t mean read this book, which I haven’t either, but you could use the wiki article to familiarize yourself with the historical episodes that the book talks about.) See also https://en.wikipedia.org/wiki/Heaven’s_Gate_(religious_group)
        
        However, this does not seem important for my (intended) original point. Namely, if you’re trying to align e.g. a brute-force-search plan maximizer or a grader-optimizer, you will fail due to high-strength optimizer’s curse forcing you to evaluate extremely scary adversarial inputs. But also this is sideways of real-world alignment, where realistic motivations may not be best specified in the form of “utility function over observation/universe histories.”
        
        My counterpoint here is, we have an example of human-aligned shard-based agents (namely humans), who are nevertheless unsafe in part because they fall prey to dangerous thoughts, which they themselves generate because they inevitably have to do some amount of search/optimization (of their thoughts/plans) as they try to reach their goals, and dangerous-thought density is high enough that even that limited amount of search/optimization is enough to frequently (on a societal level) hit upon dangerous thoughts.
        
        Wouldn’t a shard-based aligned AI have to do as much search/optimization as a human society collectively does, in order to be as competent/intelligent, in which case wouldn’t it be as likely to be unsafe in this regard? And what if it has an even higher density of dangerous thoughts, especially “out of distribution”, and/or does more search/optimization to try to find better-than-human thoughts/plans?
        
        (My own proposal here is to try to solve metaphilosophy or understand “correct reasoning” so that we / our AIs are able to safely think any thought or evaluation any plan, or at least have some kind of systematic understanding of what thoughts/plans are dangerous to think about. Or work on some more indirect way of eventually achieving something like this.)
- jacob_cannell 18 Nov 2022 22:03 UTC
  5 points
  3
  Parent
  
  Another reason to think about argmax in relation to AI safety/alignment is if you design an AI that doesn’t argmax (or do its best to approximate argmax),
  
  Actual useful AGI will not be built from argmax, because it’s not really useful for efficient approximate planning. You have exponential (in time) uncertainty from computational approximation and fundamental physics. This results in uncertainty over future state value estimates, and if you try to argmax with that uncertainty you are just selecting for noise. The correct solutions for handling uncertainty lead to something more like softmax or soft actor critic which avoids these issues (and also naturally leads to empowerment as an emergent heuristic).
  
  So argmax is only useful in toy problem domains, mostly worthless for real world planning. To the extent much of standard alignment arguments now rests on this misunderstanding, those arguments are misfounded.
  - Wei Dai 19 Nov 2022 14:14 UTC
    3 points
    2
    Parent
    Which of the standard alignment arguments do you think no longer hold up if we replace argmax with softmax?
    
    The first one that comes to my mind is: suppose we live in a world where intelligence explosion is possible, and someone builds an AI with flawed utility function, it would quickly become superintelligent and ignore orders to shut down because shutting down has lower expected utility than not shutting down. It seems to me that replacing the argmax in the AI’s decision procedure with softmax results in the same outcome, since the AI’s estimated expected utility of not shutting down would be vastly greater than shutting down, resulting in a softmax of near 1 for that option.
    
    Am I misunderstanding something in the paragraph above, or do you have other arguments in mind?
    - jacob_cannell 19 Nov 2022 19:28 UTC
      4 points
      1
      Parent
      
      Which of the standard alignment arguments do you think no longer hold up if we replace argmax with softmax?
      
      The specific argument that you just referenced in your earlier comment: that argmax is important for competitiveness, but that argmax is inherently unsafe because of adversarial optimization (“argmax is a trap”).
      
      The first one that comes to my mind is: suppose we live in a world where intelligence explosion is possible, and someone builds an AI with flawed utility function,
      
      If you assume you’ve already completely failed then the how/why is less interesting.
      
      The argmax argument expounded further is that any slight imperfection in the utility function results in doom, because of adversarial optimization magnifying that slight imperfection as you extend the planning horizon into the far future and improve planning/modeling precision.
      
      But that isn’t actually how it works. Instead due to compounding planning uncertainty far future value distributions are high variance and you get convergence to empowerment as I mentioned in the linked discussion.
      
      But that’s good news because it means that small mis-specifications in the utility function model converge away rather than diverging to infinity. The planning trajectory just converges to empowerment, regardless of the utility function, so this is good news for alignment.
      - Wei Dai 19 Nov 2022 22:28 UTC
        4 points
        3
        Parent
        
        The specific argument that you just referenced in your earlier comment: that argmax is important for competitiveness, but that argmax is inherently unsafe because of adversarial optimization (“argmax is a trap”).
        
        Assuming softmax is important for competitiveness instead, I don’t see why this argument doesn’t go through with “argmax” replaced by “softmax” throughout (including the “argmax is a trap” section of the OP). I read your linked comment and post, and still don’t understand. I wonder what the authors of the OP (or anyone else) think about this.
- TurnTrout 26 Nov 2022 3:33 UTC
  LW: 3 AF: 3
  0
  AF Parent
  Value-child: The mother makes her kid care about working hard and behaving well.
  How does one do this? (Not entirely rhetorical.)
  See here for more on what value-child’s cognition might look like.
- TurnTrout 21 Nov 2022 20:05 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Thanks for leaving the comments!
  Value-child: The mother makes her kid care about working hard and behaving well.
  How does one do this? (Not entirely rhetorical.)
  I don’t know how to do it perfectly, of course.^[1] But I infer that it can be done, because there exist people who in fact intrinsically care about working hard and behaving well. So why can’t the child also be made to make decisions in a similar manner? Take those values and transplant them into the child via some kind of “model surgery.” (Unrealistic, yes. But so was “inner-align the child onto the evaluations output by his model of his mom.”)
  All that the parable requires is that it can be done, that we are talking about a realistic and possible mind design pattern.
  I also wrote in a footnote:
  Value-child is not trying to find a plan which he would evaluate as good. He is finding plans which evaluate as good. I think this is the kind of motivation which real-world intelligences tend to have. (More on how value-child works in the next essay.)
  1. ^
    More concretely, I’m happy to make guesses like “judiciously supply M&Ms and praise to reward-shape them when they’re working hard and behaving well, and emphasize why they’re getting the rewards—they’re working hard and behaving well” and “show them cool media where the protagonist works hard and behaves well.”
- Gunnar_Zarncke 18 Nov 2022 14:00 UTC
  2 points
  0
  Parent
  > Value-child: The mother makes her kid care about working hard and behaving well.
  How does one do this? (Not entirely rhetorical.)
  I think this post is not trying to answer this but just pointing out the discrepancy. The next post will probably come back to this:
  In the next essay, I’ll point out how this is obstacle is an artifact of these design patterns, and not any intrinsic difficulty of alignment.