Well, this line of research started out as part of FAI. We need to figure out how to encode human values into AI and make it stable. So it doesn’t make much sense to study a decision theory that will immediately self-modify into something else if it anticipates counterfactual muggings in the future. It’s better to study whatever it will self-modify into. Same as with game theory—even if you want to study non-equilibrium behavior, figuring out equilibrium first is a better starting point.
Well that’s not the only reason to look into this problem. What about it being a good practise problem and an opportunity to learn to think more clearly? I mean, you can never know in advance what you’ll discover when you start investigating a problem.
Yeah, agreed. I think if you look deeper into Counterfactual Mugging and when a copy should stop caring about other copies, you’ll eventually arrive at the logical updatelessness problem which is currently unsolved.
Here’s the simplest statement of the problem. Imagine you’re charged with building a successor AI, which will be faced with Counterfactual Mugging with a logical coin. The resulting money will go to you. You know the definition of the coin already (say, the parity of the trillionth digit of pi) but you don’t have enough time to compute it. However, the successor will have enough time to compute the coin. So from your perspective, if your beliefs about the coin are 50:50, you want to build a successor that will pay up if asked—even if the coin’s value will be as obvious to the successor as 2+2=4. So it seems like a strong agent built by a weak agent should inherit its “logical prior”, acting as if some sentences still have nonzero “logical probability” even after they are proven to be false. Nobody knows how such a prior could work, we’ve made several attempts but they pretty much all failed.
Thanks for that explanation. I’ve been reading a lot about various paradoxes recently and I’ve been meaning to get into the issue of logical counterfactuals, but I’m still reading about decision theory more generally, so I haven’t had a chance to dig into it yet. But trying to figure out exactly what is going on in problems like Counterfactual Mugging seems like a reasonable entry point. But I was having trouble understanding what was going on even in the basic Counterfactual Mugging, so I decided to take a step back and try to gain a solid understanding of imperfect predictor Parfit’s Hitchhiker first. But even that was too hard, so I came up with the Evil Genie problem.
Do you now understand UDT’s reasoning in Evil Genie, imperfect Parfit’s Hitchhiker and Counterfactual Mugging? (Without going into the question whether it’s justified.)
Mostly. So you optimise over all agents who experience a particular observation in at least one observation-action map?
Hmm, it seems like this would potential create issues with irrelevant considerations. As I wrote in my other comment:
“We can imagine adding in an irrelevant decision Z that expands the reference class to cover all individuals as follows. Firstly, if it is predicted that you will take option Z, everyone’s minds are temporarily overwritten so that they are effectively clones of you facing the problem under discussion. Secondly, Option Z causes everyone who chooses this option lose a large amount of utility so no-one should ever take it. But according to this criteria, it would expand the reference class used even when comparing choice C to choice D. It doesn’t seem that a decision that is not taken should be able to do this.”
Extra Information:
So if I wanted to spec this out more specifically:
Assume that there are 100 clones and you’ve just observed that you are in a red room. If Omegae predicts that the person in room 1 will choose options A or B, then only room 1 will be red, otherwise all rooms will be red. The utilities depending on the choice of the person in room 1 are as follows:
Option A: Provides the person in room 1 with 100 utility and everyone else with −10 utility
Option B: Provides the person in room 1 with 50 utility and everyone else with 0 (I originally accidentally wrote room 2 in this line).
Option C: Provides everyone in all rooms with −1000 utility
If given only Option A or Option B, then you should choose option A. But once we add in option C, you could be any of the clones in that observation-action map, so it seems like you ought to preference option B over option A.
It seems like there are two ways to read your problem. Are there “knock-on effects”—if the person in room 1 chooses A, does that make everyone else lose 10 utility on top of anything else they might do? Or does each choice affect only the person who made it?
If there are no knock-on effects, UDT says you should choose A if you’re in a red room or B otherwise. If there are knock-on effects, UDT says you should choose B regardless of room color. In both cases it doesn’t matter if C is available. I think you meant the former case, so I’ll explain the analysis for it.
An “observation-action map” is a map from observations to actions. Here’s some examples of observation-action maps:
1) Always choose B
2) Choose A if you’re in a red room, otherwise B
3) Choose C if you’re in a red room, otherwise A
And so on. There are 9 possible maps if C is available, or 4 if C is not available.
For each map, UDT imagines all instances of itself acting according to that map, and calculates the aggregate utility of all people it cares about in the resulting timeline. (Utilities of people can be aggregated by summing or averaging, which corresponds to UDT+SIA vs UDT+SSA, but in this problem they are equivalent because the number of people is fixed at 100, so I’ll just use summing.) By that criterion it chooses the map that leads to highest utility, and acts according to it. Let’s look at the three example maps above and figure out the resulting utilities, assuming no knock-on effects:
1) Everyone chooses B, so the person in room 2 gets 50 utility and everyone else gets 0. Total utility is 50.
2) The person in room 1 chooses A, so Omega paints the remaining rooms non-red, making everyone else choose B. That means one person gets 100 utility, one person gets 50 utility, and everyone else gets 0. Total utility is 150.
3) The person in room 1 chooses C, so Omega paints the remaining rooms red and everyone else chooses C as well. Total utility is −100000.
And so on for all other possible maps. In the end, map 2 leads to highest utility, regardless of whether option C is available.
Oh, I just realised that my description was terrible. Firstly, I said option b added 50 utility to the person in room 2, instead of room 1. Secondly, only the decision of the person in room 1 matters and it determines the utility everyone else gains.
Any map where the person in room 1 selects room A produces −890 utility total (100-99*10), while any map where the person in room 1 selects B produces 50 in total.
Yet, suppose we only compared A and B and option C didn’t exist. Then you always know that you are the original as you are in the red room and none of the clones are. The only issue I can see with this reasoning is if UDT insists that you care about all the clones, even when you know that you are the original before you’ve made your decision.
Okay, then UDT recommends selecting B regardless of whether C is available. The decision is “self-sacrificing” from the decider’s point of view, but that’s fine. Here’s a simpler example: we make two copies of you, then one of them is asked to pay ten dollars so the other can get a million. Of course agreeing to pay is the right choice! That’s how you’d precommit before the copies got created. And if FAI is faced with that kind of choice, I’m damn sure I want it to pay up.
So you care about all clones, even if they have different experiences/you could never have been them? I only thought you meant that UDT assumed that you cared about copies who “could be you”.
It seems like we could make them semi-clones who use the same decision making algorithm as you, but who have a completely different set of memories so long as it doesn’t affect their process for making the decision. In fact, we could ensure that they have different memories from the first moment of their existence. Why do you care about these semi-clones equally about yourself?
However, you choose option C, in addition to being put in a red room, their memories are replaced so that they have same memories as you.
But suppose option C doesn’t exist, would UDT still endorse choosing option B?
At this point the theory starts to become agnostic, it can take an arbitrary “measure of caring” and give you the best decision according to that. If you’re a UDT agent before the experiment and you care equally about all future clones no matter how much they are tampered with, you choose B. On the other hand, if you have zero caring for clones who were tampered with, you choose A. The cutoff point depends on how much you care for inexact clones. The presence or absence of C still doesn’t matter. Does that make sense?
Let’s see if I’ve got it. So you aggregate using the total or average on a per decision basis (or in UDT 1.1 per observation-action mapping) meaning that individuals who count for one decision may not count in another even if they still exist?
Sorry, I’ve tried looking at the formalisation of UDT, but it isn’t particularly easy to follow. It just assumes that you have a utility function that maps from the execution histories of a set of programs to a real number. It doesn’t say anything about how this should be calculated.
I don’t think it is possible to construct anything simpler, but I can explain in more detail
Suppose you only care about perfect clones. If you select decision C, then Omega has made your semi-clones actual clones, so you should aggregate over all individuals. However, if you select A, they are still only semi-clones so you aggregate over just yourself.
Sure. Then the answer to my question: “individuals who count for one decision may not count in another even if they still exist?” is yes. Agreed?
(Specifically, the semi-clones still exist in A and B, they just haven’t had their memories swapped out in such a way such that they would count as clones)
If you agree, then there isn’t an issue. This test case was designed to create an issue for theories that insist that this ought not to occur.
Why is this important? Because it allows us to create no-win scenarios. Suppose we go back to the genie problem where pressing the button creates semi-clones. If you wish to be pelted by eggs, you know that you are the original, so you regret not wishing for the perfect life. The objection you made before doesn’t hold as you don’t care about the semi-clones. But if you wish for the perfect life, you then know that you are overwhelmingly likely to be a semi-clone, so you regret that decision too.
Yeah, now I see what kind of weirdness you’re trying to point out, and it seems to me that you can recreate it without any clones or predictions or even amnesia. Just choose ten selfish people and put them to sleep. Select one at random, wake him up and ask him to choose between two buttons to press. If he presses button 1, give him a mild electric shock, then the experiment ends and everyone wakes up and goes home. But if he presses button 2, give him a candy bar, wake up the rest of the participants in separate rooms and offer the same choice to each, except this time button 1 leads to nothing and button 2 leads to shock. The setup is known in advance to all participants, and let’s assume that getting shocked is as unpleasant as the candy bar is pleasant.
In this problem UDT says you should press button 1. Yeah, you’d feel kinda regretful having to do that, knowing that it makes you the only person to be offered the choice. You could just press button 2, get a nice candy bar instead of a nasty shock, and screw everyone else! But I still feel that UDT is more likely to be right than some other decision theory telling you to press button 2, given what that leads to.
Perhaps it is, but I think it is worth spending some time investigating this and identifying the advantages and disadvantages of different resolutions.
Hmm, I’m not sure that it works. If you press 1, it doesn’t mean that you’re the first person woken. You need them to be something like semi-clones for that. And the memory trick is only for the Irrelevant Considerations argument. That leaves the prediction element which isn’t strictly necessary, but allows the other agents in your reference class (if you choose perfect life) to exist at the same time the original is making its decision, which makes this result even more surprising.
Agree about the semi-clones part. This is similar to the Prisoner’s Dilemma: if you know that everyone else cooperates (presses button 1), you’re better off defecting (pressing button 2). Usually I prefer to talk about problems where everyone has the same preference over outcomes, because in such problems UDT is a Nash equilibrium. Whereas in problems where people have selfish preferences but cooperate due to symmetry, like this problem or the symmetric Prisoner’s Dilemma, UDT still kinda works but stops being a Nash equilibrium. That’s what I was trying to point out in this post.
Well, this line of research started out as part of FAI. We need to figure out how to encode human values into AI and make it stable. So it doesn’t make much sense to study a decision theory that will immediately self-modify into something else if it anticipates counterfactual muggings in the future. It’s better to study whatever it will self-modify into. Same as with game theory—even if you want to study non-equilibrium behavior, figuring out equilibrium first is a better starting point.
Well that’s not the only reason to look into this problem. What about it being a good practise problem and an opportunity to learn to think more clearly? I mean, you can never know in advance what you’ll discover when you start investigating a problem.
Yeah, agreed. I think if you look deeper into Counterfactual Mugging and when a copy should stop caring about other copies, you’ll eventually arrive at the logical updatelessness problem which is currently unsolved.
Here’s the simplest statement of the problem. Imagine you’re charged with building a successor AI, which will be faced with Counterfactual Mugging with a logical coin. The resulting money will go to you. You know the definition of the coin already (say, the parity of the trillionth digit of pi) but you don’t have enough time to compute it. However, the successor will have enough time to compute the coin. So from your perspective, if your beliefs about the coin are 50:50, you want to build a successor that will pay up if asked—even if the coin’s value will be as obvious to the successor as 2+2=4. So it seems like a strong agent built by a weak agent should inherit its “logical prior”, acting as if some sentences still have nonzero “logical probability” even after they are proven to be false. Nobody knows how such a prior could work, we’ve made several attempts but they pretty much all failed.
Thanks for that explanation. I’ve been reading a lot about various paradoxes recently and I’ve been meaning to get into the issue of logical counterfactuals, but I’m still reading about decision theory more generally, so I haven’t had a chance to dig into it yet. But trying to figure out exactly what is going on in problems like Counterfactual Mugging seems like a reasonable entry point. But I was having trouble understanding what was going on even in the basic Counterfactual Mugging, so I decided to take a step back and try to gain a solid understanding of imperfect predictor Parfit’s Hitchhiker first. But even that was too hard, so I came up with the Evil Genie problem.
Do you now understand UDT’s reasoning in Evil Genie, imperfect Parfit’s Hitchhiker and Counterfactual Mugging? (Without going into the question whether it’s justified.)
Mostly. So you optimise over all agents who experience a particular observation in at least one observation-action map?
Hmm, it seems like this would potential create issues with irrelevant considerations. As I wrote in my other comment:
“We can imagine adding in an irrelevant decision Z that expands the reference class to cover all individuals as follows. Firstly, if it is predicted that you will take option Z, everyone’s minds are temporarily overwritten so that they are effectively clones of you facing the problem under discussion. Secondly, Option Z causes everyone who chooses this option lose a large amount of utility so no-one should ever take it. But according to this criteria, it would expand the reference class used even when comparing choice C to choice D. It doesn’t seem that a decision that is not taken should be able to do this.”
Extra Information:
So if I wanted to spec this out more specifically:
Assume that there are 100 clones and you’ve just observed that you are in a red room. If Omegae predicts that the person in room 1 will choose options A or B, then only room 1 will be red, otherwise all rooms will be red. The utilities depending on the choice of the person in room 1 are as follows:
Option A: Provides the person in room 1 with 100 utility and everyone else with −10 utility
Option B: Provides the person in room 1 with 50 utility and everyone else with 0 (I originally accidentally wrote room 2 in this line).
Option C: Provides everyone in all rooms with −1000 utility
If given only Option A or Option B, then you should choose option A. But once we add in option C, you could be any of the clones in that observation-action map, so it seems like you ought to preference option B over option A.
It seems like there are two ways to read your problem. Are there “knock-on effects”—if the person in room 1 chooses A, does that make everyone else lose 10 utility on top of anything else they might do? Or does each choice affect only the person who made it?
If there are no knock-on effects, UDT says you should choose A if you’re in a red room or B otherwise. If there are knock-on effects, UDT says you should choose B regardless of room color. In both cases it doesn’t matter if C is available. I think you meant the former case, so I’ll explain the analysis for it.
An “observation-action map” is a map from observations to actions. Here’s some examples of observation-action maps:
1) Always choose B
2) Choose A if you’re in a red room, otherwise B
3) Choose C if you’re in a red room, otherwise A
And so on. There are 9 possible maps if C is available, or 4 if C is not available.
For each map, UDT imagines all instances of itself acting according to that map, and calculates the aggregate utility of all people it cares about in the resulting timeline. (Utilities of people can be aggregated by summing or averaging, which corresponds to UDT+SIA vs UDT+SSA, but in this problem they are equivalent because the number of people is fixed at 100, so I’ll just use summing.) By that criterion it chooses the map that leads to highest utility, and acts according to it. Let’s look at the three example maps above and figure out the resulting utilities, assuming no knock-on effects:
1) Everyone chooses B, so the person in room 2 gets 50 utility and everyone else gets 0. Total utility is 50.
2) The person in room 1 chooses A, so Omega paints the remaining rooms non-red, making everyone else choose B. That means one person gets 100 utility, one person gets 50 utility, and everyone else gets 0. Total utility is 150.
3) The person in room 1 chooses C, so Omega paints the remaining rooms red and everyone else chooses C as well. Total utility is −100000.
And so on for all other possible maps. In the end, map 2 leads to highest utility, regardless of whether option C is available.
Does that make sense?
Oh, I just realised that my description was terrible. Firstly, I said option b added 50 utility to the person in room 2, instead of room 1. Secondly, only the decision of the person in room 1 matters and it determines the utility everyone else gains.
Any map where the person in room 1 selects room A produces −890 utility total (100-99*10), while any map where the person in room 1 selects B produces 50 in total.
Yet, suppose we only compared A and B and option C didn’t exist. Then you always know that you are the original as you are in the red room and none of the clones are. The only issue I can see with this reasoning is if UDT insists that you care about all the clones, even when you know that you are the original before you’ve made your decision.
Okay, then UDT recommends selecting B regardless of whether C is available. The decision is “self-sacrificing” from the decider’s point of view, but that’s fine. Here’s a simpler example: we make two copies of you, then one of them is asked to pay ten dollars so the other can get a million. Of course agreeing to pay is the right choice! That’s how you’d precommit before the copies got created. And if FAI is faced with that kind of choice, I’m damn sure I want it to pay up.
So you care about all clones, even if they have different experiences/you could never have been them? I only thought you meant that UDT assumed that you cared about copies who “could be you”.
It seems like we could make them semi-clones who use the same decision making algorithm as you, but who have a completely different set of memories so long as it doesn’t affect their process for making the decision. In fact, we could ensure that they have different memories from the first moment of their existence. Why do you care about these semi-clones equally about yourself?
However, you choose option C, in addition to being put in a red room, their memories are replaced so that they have same memories as you.
But suppose option C doesn’t exist, would UDT still endorse choosing option B?
At this point the theory starts to become agnostic, it can take an arbitrary “measure of caring” and give you the best decision according to that. If you’re a UDT agent before the experiment and you care equally about all future clones no matter how much they are tampered with, you choose B. On the other hand, if you have zero caring for clones who were tampered with, you choose A. The cutoff point depends on how much you care for inexact clones. The presence or absence of C still doesn’t matter. Does that make sense?
Let’s see if I’ve got it. So you aggregate using the total or average on a per decision basis (or in UDT 1.1 per observation-action mapping) meaning that individuals who count for one decision may not count in another even if they still exist?
Sorry, I’ve tried looking at the formalisation of UDT, but it isn’t particularly easy to follow. It just assumes that you have a utility function that maps from the execution histories of a set of programs to a real number. It doesn’t say anything about how this should be calculated.
Hmm, not sure I understand the question. Can you make an example problem where “individuals who count for one decision may not count in another”?
I don’t think it is possible to construct anything simpler, but I can explain in more detail
Suppose you only care about perfect clones. If you select decision C, then Omega has made your semi-clones actual clones, so you should aggregate over all individuals. However, if you select A, they are still only semi-clones so you aggregate over just yourself.
Is that a correct UDT analysis?
Hmm, now the problem seems equivalent to this:
A) Get 100 utility
B) Get 50 utility
C) Create many clones and give each −1000 utility
If you’re indifferent to mere existence of clones otherwise, you should choose A. Seems trivial, no?
Sure. Then the answer to my question: “individuals who count for one decision may not count in another even if they still exist?” is yes. Agreed?
(Specifically, the semi-clones still exist in A and B, they just haven’t had their memories swapped out in such a way such that they would count as clones)
If you agree, then there isn’t an issue. This test case was designed to create an issue for theories that insist that this ought not to occur.
Why is this important? Because it allows us to create no-win scenarios. Suppose we go back to the genie problem where pressing the button creates semi-clones. If you wish to be pelted by eggs, you know that you are the original, so you regret not wishing for the perfect life. The objection you made before doesn’t hold as you don’t care about the semi-clones. But if you wish for the perfect life, you then know that you are overwhelmingly likely to be a semi-clone, so you regret that decision too.
Yeah, now I see what kind of weirdness you’re trying to point out, and it seems to me that you can recreate it without any clones or predictions or even amnesia. Just choose ten selfish people and put them to sleep. Select one at random, wake him up and ask him to choose between two buttons to press. If he presses button 1, give him a mild electric shock, then the experiment ends and everyone wakes up and goes home. But if he presses button 2, give him a candy bar, wake up the rest of the participants in separate rooms and offer the same choice to each, except this time button 1 leads to nothing and button 2 leads to shock. The setup is known in advance to all participants, and let’s assume that getting shocked is as unpleasant as the candy bar is pleasant.
In this problem UDT says you should press button 1. Yeah, you’d feel kinda regretful having to do that, knowing that it makes you the only person to be offered the choice. You could just press button 2, get a nice candy bar instead of a nasty shock, and screw everyone else! But I still feel that UDT is more likely to be right than some other decision theory telling you to press button 2, given what that leads to.
Perhaps it is, but I think it is worth spending some time investigating this and identifying the advantages and disadvantages of different resolutions.
Hmm, I’m not sure that it works. If you press 1, it doesn’t mean that you’re the first person woken. You need them to be something like semi-clones for that. And the memory trick is only for the Irrelevant Considerations argument. That leaves the prediction element which isn’t strictly necessary, but allows the other agents in your reference class (if you choose perfect life) to exist at the same time the original is making its decision, which makes this result even more surprising.
Agree about the semi-clones part. This is similar to the Prisoner’s Dilemma: if you know that everyone else cooperates (presses button 1), you’re better off defecting (pressing button 2). Usually I prefer to talk about problems where everyone has the same preference over outcomes, because in such problems UDT is a Nash equilibrium. Whereas in problems where people have selfish preferences but cooperate due to symmetry, like this problem or the symmetric Prisoner’s Dilemma, UDT still kinda works but stops being a Nash equilibrium. That’s what I was trying to point out in this post.