This post attempts to generalize and articulate a problem that people have been thinking about since at least 2016.
I found some related discussions going back to 2009. It’s mostly highly confused, as you might expect, but I did notice this part which I’d forgotten and may actually be relevant:
But if you are TDT, you can’t always use less computing power, because that might be correlated with your opponents also deciding to use less computing power
This could potentially be a way out of the “racing to think as little as possible before making commitments” dynamic, but if we have to decide how much to let our AIs think initially before making commitments, on the basis of reasoning like this, that’s a really hairy thing to have to do. (This seems like another good reason for wanting to go with a metaphilosophical approach to AI safety instead of a decision theoretic one. What’s the point of having a superintelligent AI if we can’t let it figure these kinds of things out for us?)
If two consequentialists are attempting to divide up a pie or select a game-theoretic equilibrium to play in, the one that can “move first” can get much more than the one that “moves second.”
I’m not sure how the folk theorem shows this. Can you explain?
going updateless is like making a bunch of commitments all at once
Might be a good idea to offer some examples here to help explain updateless and for pumping intuitions.
Meanwhile, a few years ago when I first learned about the concept of updatelessness, I resolved to be updateless from that point onwards. I am now glad that I couldn’t actually commit to anything then.
Interested to hear more details about this. What would have happened if you were actually able to become updateless?
Would trying to become less confused about commitment races before building a superintelligent AI count as a metaphilosophical approach or a decision theoretic one (or neither)? I’m not sure I understand the dividing line between the two.
Trying to become less confused about commitment races can be part of either a metaphilosophical approach or a decision theoretic one, depending on what you plan to do afterwards. If you plan to use that understanding to directly give the AI a better decision theory which allows it to correctly handle commitment races, then that’s what I’d call a “decision theoretic approach”. Alternatively, you could try to observe and understand what humans are doing when we’re trying to become less confused about commitment races and program or teach an AI to do the same thing so it can solve the problem of commitment races on its own. This would be an example of what I call “metaphilosophical approach”.
I didn’t mean to suggest that the folk theorem proves anything. Nevertheless here is the intuition: The way the folk theorem proves any status quo is possible is by assuming that players start off assuming everyone else will grim trigger them for violating that status quo. So in a two-player game, if both players start off assuming player 1 will grim trigger player 2 for violating player 1′s preferred status quo, then player 1 will get what they want. One way to get this to happen is for player 1 to be “earlier in logical time” than player 2 and make a credible commitment.
As for updatelessness: Well, updateless agents follow the policy that is optimal from the perspective of the credences they have at the time they go updateless. So e.g. if there is a cowardly agent who simulates you at that time or later and then caves to your demands (if you make any) then an updateless agent will be a bully and make demands, i.e. commit to punishing people it identifies as cowards who don’t do what it wants. But of course updateless agents are also cowards themselves, in the sense that the best policy from the perspective of credences C is to cave in to any demands that have already been committed to according to C. I don’t have a super clear example of how this might lead to disaster, but I intend to work one out in the future...
Same goes for my own experience. I don’t have a clear example in mind of something bad that would have happened to me if I had actually self-modified, but I get a nervous feeling about it.
I found some related discussions going back to 2009. It’s mostly highly confused, as you might expect, but I did notice this part which I’d forgotten and may actually be relevant:
This could potentially be a way out of the “racing to think as little as possible before making commitments” dynamic, but if we have to decide how much to let our AIs think initially before making commitments, on the basis of reasoning like this, that’s a really hairy thing to have to do. (This seems like another good reason for wanting to go with a metaphilosophical approach to AI safety instead of a decision theoretic one. What’s the point of having a superintelligent AI if we can’t let it figure these kinds of things out for us?)
I’m not sure how the folk theorem shows this. Can you explain?
Might be a good idea to offer some examples here to help explain updateless and for pumping intuitions.
Interested to hear more details about this. What would have happened if you were actually able to become updateless?
Would trying to become less confused about commitment races before building a superintelligent AI count as a metaphilosophical approach or a decision theoretic one (or neither)? I’m not sure I understand the dividing line between the two.
Trying to become less confused about commitment races can be part of either a metaphilosophical approach or a decision theoretic one, depending on what you plan to do afterwards. If you plan to use that understanding to directly give the AI a better decision theory which allows it to correctly handle commitment races, then that’s what I’d call a “decision theoretic approach”. Alternatively, you could try to observe and understand what humans are doing when we’re trying to become less confused about commitment races and program or teach an AI to do the same thing so it can solve the problem of commitment races on its own. This would be an example of what I call “metaphilosophical approach”.
Thanks, edited to fix!
I agree with your push towards metaphilosophy.
I didn’t mean to suggest that the folk theorem proves anything. Nevertheless here is the intuition: The way the folk theorem proves any status quo is possible is by assuming that players start off assuming everyone else will grim trigger them for violating that status quo. So in a two-player game, if both players start off assuming player 1 will grim trigger player 2 for violating player 1′s preferred status quo, then player 1 will get what they want. One way to get this to happen is for player 1 to be “earlier in logical time” than player 2 and make a credible commitment.
As for updatelessness: Well, updateless agents follow the policy that is optimal from the perspective of the credences they have at the time they go updateless. So e.g. if there is a cowardly agent who simulates you at that time or later and then caves to your demands (if you make any) then an updateless agent will be a bully and make demands, i.e. commit to punishing people it identifies as cowards who don’t do what it wants. But of course updateless agents are also cowards themselves, in the sense that the best policy from the perspective of credences C is to cave in to any demands that have already been committed to according to C. I don’t have a super clear example of how this might lead to disaster, but I intend to work one out in the future...
Same goes for my own experience. I don’t have a clear example in mind of something bad that would have happened to me if I had actually self-modified, but I get a nervous feeling about it.