Wei Dai comments on You can, in fact, bamboozle an unaligned AI into sparing your life

Wei Dai 30 Sep 2024 3:42 UTC
7 points
−3

So the commitment I want to make is just my current self yelling at my future self, that “no, you should still bail us out even if ‘you’ don’t have a skin in the game anymore”. I expect myself to keep my word that I would probably honor a commitment like that, even if trading away 10 planets for 1 no longer seems like that good of an idea.

This doesn’t make much sense to me. Why would your future self “honor a commitment like that”, if the “commitment” is essentially just one agent yelling at another agent to do something the second agent doesn’t want to do? I don’t understand what moral (or physical or motivational) force your “commitment” is supposed to have on your future self, if your future self does not already think doing the simulation trade is a good idea.

I mean imagine if as a kid you made a “commitment” in the form of yelling at your future self that if you ever had lots of money you’d spend it all on comic books and action figures. Now as an adult you’d just ignore it, right?
- Ben Pace 30 Sep 2024 4:45 UTC
  5 points
  0
  Parent
  I have known non-zero adults to make such commitments to themselves. (But I agree it is not the typical outcome, and I wouldn’t believe most people if they told me they would follow-through.)
- Anthony DiGiovanni 1 Oct 2024 22:36 UTC
  2 points
  1
  Parent
  I strongly agree with this, but I’m confused that this is your view given that you endorse UDT. Why do you think your future self will honor the commitment of following UDT, even in situations where your future self wouldn’t want to honor it (because following UDT is not ex interim optimal from his perspective)?
  - Wei Dai 1 Oct 2024 23:13 UTC
    4 points
    2
    Parent
    I actually no longer fully endorse UDT. It still seems a better decision theory approach than any other specific approach that I know, but it has a bunch of open problems and I’m not very confident that someone won’t eventually find a better approach that replaces it.
    
    To your question, I think if my future self decides to follow (something like) UDT, it won’t be because I made a “commitment” to do it, but because my future self wants to follow it, because he thinks it’s the right thing to do, according to his best understanding of philosophy and normativity. I’m unsure about this, and the specific objection you have is probably covered under #1 in my list of open questions in the link above.
    
    (And then there’s a very different scenario in which UDT gets used in the future, which is that it gets built into AIs, and then they keep using UDT until they decide not to, which if UDT is reflectively consistent would be never. I dis-endorse this even more strongly.)
    - Anthony DiGiovanni 2 Oct 2024 8:16 UTC
      1 point
      0
      Parent
      Thanks for clarifying!
      covered under #1 in my list of open questions
      To be clear, by “indexical values” in that context I assume you mean indexing on whether a given world is “real” vs “counterfactual,” not just indexical in the sense of being egoistic? (Because I think there are compelling reasons to reject UDT without being egoistic.)
      - Wei Dai 2 Oct 2024 14:14 UTC
        2 points
        0
        Parent
        
        To be clear, by “indexical values” in that context I assume you mean indexing on whether a given world is “real” vs “counterfactual,” not just indexical in the sense of being egoistic? (Because I think there are compelling reasons to reject UDT without being egoistic.)
        
        I think being indexical in this sense (while being altruistic) can also lead you to reject UDT, but it doesn’t seem “compelling” that one should be altruistic this way. Want to expand on that?
        Anthony DiGiovanni 3 Oct 2024 7:59 UTC
        1 point
        1
        Parent
        (I might not reply further because of how historically I’ve found people seem to simply have different bedrock intuitions about this, but who knows!)
        I intrinsically only care about the real world (I find the Tegmark IV arguments against this pretty unconvincing). As far as I can tell, the standard justification for acting as if one cares about nonexistent worlds is diachronic norms of rationality. But I don’t see an independent motivation for diachronic norms, as I explain here. Given this, I think it would be a mistake to pretend my preferences are something other than what they actually are.
        Wei Dai 6 Oct 2024 15:22 UTC
        3 points
        0
        Parent
        If you only care about the real world and you’re sure there’s only one real world, then the fact that you at time 0 would sometimes want to bind yourself at time 1 (e.g., physically commit to some action or self-modify to perform some action at time 1) seems very puzzling or indicates that something must be wrong, because at time 1 you’re in a strictly better epistemic position, having found out more information about which world is real, so what sense does it make that your decision theory makes you-at-time-0 decide to override you-at-time-1′s decision?
        
        (If you believed in something like Tegmark IV but your values constantly change to only care about the subset of worlds that you’re in, then time inconsistency, and wanting to override your later selves, would make more sense, as your earlier self and later self would simply have different values. But it seems counterintuitive to be altruistic this way.)
        Anthony DiGiovanni 6 Oct 2024 18:27 UTC
        1 point
        1
        Parent
        at time 1 you’re in a strictly better epistemic position
        Right, but 1-me has different incentives by virtue of this epistemic position. Conditional on being at the ATM, 1-me would be better off not paying the driver. (Yet 0-me is better off if the driver predicts that 1-me will pay, hence the incentive to commit.)
        I’m not sure if this is an instance of what you call “having different values” — if so I’d call that a confusing use of the phrase, and it doesn’t seem counterintuitive to me at all.
- David Matolcsi 30 Sep 2024 7:02 UTC
  1 point
  1
  Parent
  I agree you can’t make actually binding commitments. But I think the kid-adult example is actually a good illustration of what I want to do: if a kid makes a solemn commitment to spend one in hundred million fraction of his money on action figures when he becomes a rich adult, I think that would usually work. And that’s what we are asking from our future selves.
  - Wei Dai 30 Sep 2024 7:35 UTC
    5 points
    0
    Parent
    Why? Perhaps we’d do it out of moral uncertainty, thinking maybe we owe something to our former selves, but future people probably won’t think this.
    Currently our utility is roughly log in money, partly because we spend money on instrumental goals and there’s diminishing returns due to limited opportunities being used up. This won’t be true of future utilitarians spending resources on their terminal values. So “one in hundred million fraction” of resources is a much bigger deal to them than to us.
    - cdt 30 Sep 2024 20:38 UTC
      4 points
      1
      Parent
      thinking maybe we owe something to our former selves, but future people probably won’t think this
      This is a very strong assertion. Aren’t most people on this forum, when making present claims about what they would like to happen in the future, trying to form this contract? (This comes back to the value lock-in debate.)