TurnTrout comments on Don’t design agents which exploit adversarial inputs

TurnTrout 24 Nov 2022 1:06 UTC
LW: 5 AF: 3
2
AF
I wrote a bunch more before realizing that we maybe don’t disagree fully on the “don’t argmax” point. Here:
But aren’t you still argmaxing within the space of plans that you haven’t closed off (or are actively considering),
Not really? I think it is inappropriately suggestive to describe this as “argmaxing.” I, for one, usually feel like I consider at most three plans during most planning sessions. Most of the work is going to be in my generative models, in my learned habits of thought, in my snap reflective assessments of what I should think about next.
How many different plans do you consider for going to the store? For writing a LessWrong post? Even if you did consider more plans, you’d convergently want to explore parts of the plan-space which you think won’t contain secret adversarial examples to your own evaluations. EG at first pass, just don’t think about entities trying to acausally blackmail you.
Argmax is an abstraction which may or may not actually describe a given cognitive process. I think that if we label reflective incremental planning and reasoning as “argmax”, we’re missing a serious opportunity for original thought, for considering in detail what the algorithm does.
and still taking a risk of finding some adversarial plan within that space?
There is indeed a risk you’ll find an adversarial plan. But what is the risk, quantitatively? A reflective agent will convergently wish to avoid thinking about plans which exploit its own evaluation procedures and reasoning (eg tricking the diamond-shard into bidding for plans). In stark contrast, grader-optimizers and argmaxers convergently want to exploit those procedures, so as to achieve higher diamond-evaluations.
How do you just “not argmax” or “not design agents which exploit adversarial inputs”?
First of all, alignment researchers should stop trying to terminally motivate agents to optimize evaluations of their plans or outcomes. That’s doomed and doesn’t make sense.
Second, A shot at the diamond alignment problem describes an agent which isn’t trying to exploit some diamond-grader. I didn’t do anything in particular in order to avoid training an agent which exploits adversarial inputs to a diamond-grader function. I think that you just don’t get that problem at all, unless you’re assuming cognition must decompose via the (IMO) strange frame of “outer/inner alignment.”
(Humans get scammed and invent or fall into cults and crazy ideologies not infrequently, despite doing what you’re describing here already.)
Note the presence of adversarial optimizers in most of these situations. The adversarial optimization comes from other people who are optimizing ideas to get spurious buy-in from victims.
I expect that smart agents convergently wish to minimize the optimizer’s curse, because that leads to more of what they want.
What links here?
- Don’t align agents to evaluations of plans by TurnTrout (26 Nov 2022 21:16 UTC; 45 points)
- Wei Dai 24 Nov 2022 2:59 UTC
  LW: 5 AF: 5
  3
  AF Parent
  Thanks for this longer reply and the link to your diamond alignment post, which help me understand your thinking better. I’m sympathetic to a lot of what you say, but feel like you tend to state your conclusions more strongly than the underlying arguments warrant.
  
  The adversarial optimization comes from other people who are optimizing ideas to get spurious buy-in from victims.
  
  I think a lot of crazy religions/ideologies/philosophies come from people genuinely trying to answer hard questions for themselves, but there are also some that are deliberate attempts to optimize against others (Scientology?).
  
  Daniel Kokotajlo described an analogy with EA, which you were going to answer but still haven’t. I would add that EA funders have to consider even more (and potentially more explicitly adversarial) proposals/plans than a typical EA, and AFAIK nobody has suggested that that’s doomed from the start because it amounts to argmaxing against an evaluator. Instead, everyone implicitly or explicitly recognizes the danger of adversarial plans, and tries to harden the evaluation process against them.
  - TurnTrout 26 Nov 2022 4:02 UTC
    LW: 2 AF: 2
    0
    AF Parent
    I agree that humans sometimes fall prey to adversarial inputs, and am updating up on dangerous-thought density based on your religion argument. Any links to where I can read more?
    However, this does not seem important for my (intended) original point. Namely, if you’re trying to align e.g. a brute-force-search plan maximizer or a grader-optimizer, you will fail due to high-strength optimizer’s curse forcing you to evaluate extremely scary adversarial inputs. But also this is sideways of real-world alignment, where realistic motivations may not be best specified in the form of “utility function over observation/universe histories.”
    there are also some that are deliberate attempts to optimize against others (Scientology?).
    (Also, major religions are presumably memetically optimized. No deliberate choice required, on my model.)
    Daniel Kokotajlo described an analogy with EA, which you were going to answer but still haven’t.
    Answered now.
    I would add that EA funders have to consider even more (and potentially more explicitly adversarial) proposals/plans than a typical EA, and AFAIK nobody has suggested that that’s doomed from the start because it amounts to argmaxing against an evaluator. Instead, everyone implicitly or explicitly recognizes the danger of adversarial plans, and tries to harden the evaluation process against them.
    This seems disanalogous to the situation discussed in the OP. If we were designing, from scratch, a system which we wanted to pursue effective altruism, we would be extremely well-advised to not include grader-optimizers which are optimizing EA funder evaluations. Especially if the grader-optimizers will eventually get smart enough to write out the funders’ pseudocode. At best, that wastes computation. At (probable) worst, the system blows up.
    By contrast, we live in a world full of other people, some of whom are optimizing for status and power. Given that world, we should indeed harden our evaluation procedures, insofar as that helps us more faithfully evaluate grants and thereby achieve our goals.
    What links here?
    Don’t align agents to evaluations of plans by TurnTrout (26 Nov 2022 21:16 UTC; 45 points)
    - Wei Dai 26 Nov 2022 7:34 UTC
      LW: 2 AF: 2
      0
      AF Parent
      
      I agree that humans sometimes fall prey to adversarial inputs, and am updating up on dangerous-thought density based on your religion argument. Any links to where I can read more?
      
      Maybe https://en.wikipedia.org/wiki/Extraordinary_Popular_Delusions_and_the_Madness_of_Crowds (I don’t mean read this book, which I haven’t either, but you could use the wiki article to familiarize yourself with the historical episodes that the book talks about.) See also https://en.wikipedia.org/wiki/Heaven’s_Gate_(religious_group)
      
      However, this does not seem important for my (intended) original point. Namely, if you’re trying to align e.g. a brute-force-search plan maximizer or a grader-optimizer, you will fail due to high-strength optimizer’s curse forcing you to evaluate extremely scary adversarial inputs. But also this is sideways of real-world alignment, where realistic motivations may not be best specified in the form of “utility function over observation/universe histories.”
      
      My counterpoint here is, we have an example of human-aligned shard-based agents (namely humans), who are nevertheless unsafe in part because they fall prey to dangerous thoughts, which they themselves generate because they inevitably have to do some amount of search/optimization (of their thoughts/plans) as they try to reach their goals, and dangerous-thought density is high enough that even that limited amount of search/optimization is enough to frequently (on a societal level) hit upon dangerous thoughts.
      
      Wouldn’t a shard-based aligned AI have to do as much search/optimization as a human society collectively does, in order to be as competent/intelligent, in which case wouldn’t it be as likely to be unsafe in this regard? And what if it has an even higher density of dangerous thoughts, especially “out of distribution”, and/or does more search/optimization to try to find better-than-human thoughts/plans?
      
      (My own proposal here is to try to solve metaphilosophy or understand “correct reasoning” so that we / our AIs are able to safely think any thought or evaluation any plan, or at least have some kind of systematic understanding of what thoughts/plans are dangerous to think about. Or work on some more indirect way of eventually achieving something like this.)