TurnTrout comments on Don’t design agents which exploit adversarial inputs

TurnTrout 21 Nov 2022 20:57 UTC
LW: 3 AF: 3
1
AF
See my comment to Wei Dai. Argmax’s violation of the adversarial principle strongly suggests the existence of a better and more natural frame on the problem.
I think “Optimizing for the output of a grader which evaluates plans” is more-or-less how human brains choose plans
I deeply disagree. I think you might be conflating the quotation and the referent, two different patterns:
1. local semi-reflective search
  1. what I think people do.
  2. “does it make sense to spend another hour thinking of alternatives?
    
    [self-model says ‘yes’]
    
    OK, I will”
  3. “do I predict ‘search for plans which would most persuade me of their virtue’ to actually lead to virtuous plans? [Self-model says ‘no’] Moving on...”
2. global search against the output of an evaluation function implemented within the brain
  1. grader-optimization
  2. “what kinds of plans would my brain like the most?”
It is possible to say sentences like “local semi-reflective search just is global search but with implicit constraints like ‘select for plans which your self-model likes’.” I don’t think this is true. I am going to posit that, as a matter of falsifiable physical fact, the human brain does not compute a predicate which, when checked against all possible plans, rules out all adversarial plans, such that you can just argmax over everything and get out what the person would have chosen/would have wanted to choose on reflection. If you argmax over human value shards relative to the plans they might grade, you’ll probably get some garbage plan where you’re, like, twitching on the floor.
I don’t think it’s feasible to make an AGI that doesn’t do that.
You’ll notice that A shot at the diamond alignment problem makes no claim of the AGI having an internal-argmax-hardened diamond value shard.