TurnTrout comments on A shot at the diamond-alignment problem

TurnTrout 26 Nov 2022 2:12 UTC
LW: 5 AF: 4
0
AF
“Consider the expected consequences of the plan “think a lot longer and harder, considering a lot more possibilities for what you should do, and then make your decision.” I currently predict that such a plan would lead future-me to waste his life doing philosophy or maybe get pascal’s mugged by some longtermist AI bullshit instead of actually helping people with his donations. My helping-people shard doesn’t like this plan, because it predicts abstractly that thinking a lot more will not result in helping people more.”

(Basically I’m saying you should think more, and then write more, about the difference between these two cases because they seem plausibly on a spectrum to me, and this should make us nervous in a couple of ways. Are we actually being really stupid by being EAs and shutting up and calculating? Have we basically adversarial-exampled ourselves away from doing things that we actually thought were altruistic and effective back in the day? If not, what’s different about the kind of extended search process we did, from the logical extension of that which is to do an even more extended search process, a sufficiently extreme search process that outsiders would call the result an adversarial example?)
I think there are several things happening. Here are some:
1. If an EA-to-be (“EAlice”, let’s say) in fact thought that EA would make her waste her life on bullshit, but went ahead anyways, she subjectively made a mistake.
  1. Were her expectations correct? That’s another question. I personally think that AI ruin is real, it’s not low-probability Pascal’s Mugging BS, it’s default-outcome IMO.
2. I think many EAs are making distortionary values choices.
  1. There is a socially easy way to quash parts of yourself which don’t have immediate sophisticated-sounding arguments backing them up.
    But whatever values you do have (eg caring about your family), whatever caring you originally developed (via RL, according to shard theory), didn’t originally come via some grand consequentialist or game-theoretic argument about happiness or freedom.
    So why should other values, like “avoiding spiders” or “taking time to relax”, have to justify themselves? They’re part of my utility function, so to speak! That’s not up for grabs!
  2. I care more about my mom than other peoples’ moms. Sue me!
  3. I agree with much of Self-Integrity and the Drowning Child.
3. I think a bunch of this has to do with meta-ethics, not with adversarial examples to values.
  1. It might be that your original “helping people” values are not what your old value-distribution would have reflectively endorsed. Like, maybe you were just prioritizing your friends and neighbors, but if you’d ever really thought about it, your reflective strong broadly activated shard coalition would have ruled “hey let’s care about faraway people more.”
    EG cooperation + happiness + empathy + fairness + local-helping shard → generalize by creating a global-helping shard
  2. Or maybe EA did in fact trick EAlice and socially pressure and reshape her into a new being for whom this is reflectively endorsed.
    EG social shard → global-helping shard
  3. Although EA is in fact selecting for people against whom its arguments constitute (weak) adversarial inputs. I don’t think the selection is that strong? Confused here.
EDIT: One of the main threads is Don’t design agents which exploit adversarial inputs. The point isn’t that people can’t or don’t fall victim to plans which, by virtue of spurious appeal to a person’s value shards, cause the person to unwisely pursue the plan. The point here is that (I claim) intelligent people convergently want to avoid this happening to them.
A diamond-shard will not try to find adversarial inputs to itself. That was my original point, and I think it stands.
What links here?
- Don’t align agents to evaluations of plans by TurnTrout (26 Nov 2022 21:16 UTC; 45 points)
- Daniel Kokotajlo 26 Nov 2022 10:49 UTC
  LW: 6 AF: 6
  1
  AF Parent
  I think I agree with everything you said yet still feel confused. My question/objection/issue was not so much “How do you explain people sometimes falling victim to plans which spuriously appeal to their value shards!?!? Checkmate!” but rather “what does it mean for an appeal to be spurious? What is the difference between just thinking long and hard about what to do vs. adversarially selecting a plan that’ll appeal to you? Isn’t the former going to in effect basically equal the latter, thanks to extremal Goodhart? In the limit where you consider all possible plans (maximum optimization power), aren’t they the same?”
  - TurnTrout 28 Nov 2022 22:56 UTC
    LW: 4 AF: 4
    0
    AF Parent
    Yes, that’s a good question. This is what I’ve been aiming to answer with recent posts.
    What is the difference between just thinking long and hard about what to do vs. adversarially selecting a plan that’ll appeal to you? Isn’t the former going to in effect basically equal the latter, thanks to extremal Goodhart? In the limit where you consider all possible plans (maximum optimization power), aren’t they the same?”
    (I’m presently confident the answer is “no”, as might be clear from my comments and posts!)
    - Daniel Kokotajlo 28 Nov 2022 23:13 UTC
      LW: 2 AF: 2
      0
      AF Parent
      OK, guess I’ll go read those posts then...