Wei Dai comments on You can, in fact, bamboozle an unaligned AI into sparing your life

Wei Dai 1 Oct 2024 23:13 UTC
4 points
2
I actually no longer fully endorse UDT. It still seems a better decision theory approach than any other specific approach that I know, but it has a bunch of open problems and I’m not very confident that someone won’t eventually find a better approach that replaces it.

To your question, I think if my future self decides to follow (something like) UDT, it won’t be because I made a “commitment” to do it, but because my future self wants to follow it, because he thinks it’s the right thing to do, according to his best understanding of philosophy and normativity. I’m unsure about this, and the specific objection you have is probably covered under #1 in my list of open questions in the link above.

(And then there’s a very different scenario in which UDT gets used in the future, which is that it gets built into AIs, and then they keep using UDT until they decide not to, which if UDT is reflectively consistent would be never. I dis-endorse this even more strongly.)
- Anthony DiGiovanni 2 Oct 2024 8:16 UTC
  1 point
  0
  Parent
  Thanks for clarifying!
  covered under #1 in my list of open questions
  To be clear, by “indexical values” in that context I assume you mean indexing on whether a given world is “real” vs “counterfactual,” not just indexical in the sense of being egoistic? (Because I think there are compelling reasons to reject UDT without being egoistic.)
  - Wei Dai 2 Oct 2024 14:14 UTC
    2 points
    0
    Parent
    
    To be clear, by “indexical values” in that context I assume you mean indexing on whether a given world is “real” vs “counterfactual,” not just indexical in the sense of being egoistic? (Because I think there are compelling reasons to reject UDT without being egoistic.)
    
    I think being indexical in this sense (while being altruistic) can also lead you to reject UDT, but it doesn’t seem “compelling” that one should be altruistic this way. Want to expand on that?
    - Anthony DiGiovanni 3 Oct 2024 7:59 UTC
      1 point
      1
      Parent
      (I might not reply further because of how historically I’ve found people seem to simply have different bedrock intuitions about this, but who knows!)
      I intrinsically only care about the real world (I find the Tegmark IV arguments against this pretty unconvincing). As far as I can tell, the standard justification for acting as if one cares about nonexistent worlds is diachronic norms of rationality. But I don’t see an independent motivation for diachronic norms, as I explain here. Given this, I think it would be a mistake to pretend my preferences are something other than what they actually are.
      - Wei Dai 6 Oct 2024 15:22 UTC
        2 points
        0
        Parent
        If you only care about the real world and you’re sure there’s only one real world, then the fact that you at time 0 would sometimes want to bind yourself at time 1 (e.g., physically commit to some action or self-modify to perform some action at time 1) seems very puzzling or indicates that something must be wrong, because at time 1 you’re in a strictly better epistemic position, having found out more information about which world is real, so what sense does it make that your decision theory makes you-at-time-0 decide to override you-at-time-1′s decision?
        
        (If you believed in something like Tegmark IV but your values constantly change to only care about the subset of worlds that you’re in, then time inconsistency, and wanting to override your later selves, would make more sense, as your earlier self and later self would simply have different values. But it seems counterintuitive to be altruistic this way.)
        Anthony DiGiovanni 6 Oct 2024 18:27 UTC
        1 point
        1
        Parent
        at time 1 you’re in a strictly better epistemic position
        Right, but 1-me has different incentives by virtue of this epistemic position. Conditional on being at the ATM, 1-me would be better off not paying the driver. (Yet 0-me is better off if the driver predicts that 1-me will pay, hence the incentive to commit.)
        I’m not sure if this is an instance of what you call “having different values” — if so I’d call that a confusing use of the phrase, and it doesn’t seem counterintuitive to me at all.