Decision Theory with F@#!ed-Up Reference Classes
Before we can answer the question of what ought you do, we need to identify exactly what agents are referred to by you. In some problems, you refers to only a single, easily identifiable agent with actions providing deterministic results, but in other problems many agents will experience positions that are completely indistinguishable. Even then, we can normally identify a fixed set of agents who are possibly you and average over them. However, there do exist a set of problems where this set of indistinguishable agents depends on the decision that you make, at which point it becomes rather unclear who exactly you are trying to optimise over. We will say that these problems have Decision-Inconsistent Reference Classes.
While this may seem like merely a niche issue, given the butterfly effect and a sufficiently long timeline with the possibility of simulations, it is almost guaranteed that any decision will change the reference class. So understanding how to resolve these issues is more important than it might first appear. More importantly, if I am correct, Imperfect Parfit’s Hitchhiker doesn’t have a real answer and UDT would require some rather significant modifications.
(This post is based upon the material in this comment, which I said that I was planning on developing into a full post. It contains some substantial corrections and additions)
Motivation
My exploration of this area is mostly motivated by Imperfect Parfit’s Hitchhiker. Here we define this as Parfit’s Hitchhiker with a driver that always detects when you are telling the truth about paying, but 1% of the time picks you up independently of whether you are or aren’t being truthful. We’ll also imagine that those agents who arrive in town discover a week after their decision whether or not they were in the group who would have been picked up independent of their decision.
Solving this problem involves challenges that aren’t present in the version with perfect predictors. After all, once we’ve defined a notion of counterfactuals for perfect predictors, (harder than it looks!), it’s clear that defecting against these predictors is a losing strategy. There is no (inside-view) downside to committing to taking a sub-optimal action given an input that ought to be impossible. However, as soon as the predictors have even an arbitrarily small amount of imperfection, choosing to pay actually means giving up something.
Given the natural human tendency towards fairness, it may be useful to recall the True Prisoner’s Dilemma—what if instead of rescuing one person, the driver rescued your entire family and instead of demanding $50 he demanded that a random 50% of you be executed. In this new scenario, refusing to “pay” him for his services no longer seems quite so fair. And if you can get the better of him, why not do so? Or if this isn’t sufficient, we can imagine the driver declaring that it’s fair game to try fooling him into thinking that you’ll pay.
Now that our goal is to beat the driver if that is at all possible, we can see that this is prima facie ambiguous as it isn’t clear which agents we wish to optimise over. If we ultimately defect, then only 1% of agents arrive in town, but if we ultimately pay, then 100% of agents arrive in town. Should we optimise over the 1% or the 100%? Consider after you’ve locked in your decision, but before it’s revealed whether you would have been picked up anyway (call this the immediate aftermath). Strangely, in the immediate aftermath you will reflectively endorse whatever decision you made. An agent who decided to defect knows that they were in the 1% and so they would have always ended up in town; while those who decided to pay will assign only a 1% probability that they were going to be picked up if they hadn’t paid. In the later case, the agent may later regret paying if they discover that they were indeed in the 1%, but this will only be later, not in the immediate aftermath.
Some Example problems
It’ll be easier to attempt solving this problem if we gather some other problems that mess with reference classes. One such problem is the Evil Genie Puzzle I defined in a previous post. This creates the exact opposite problem—you reflectively regret whichever decision you choose. If you choose the million dollars (I wrote perfect life instead in the post), you know in the immediate aftermath that you are almost certainly a clone, so you should expect to be tortured. However, if you choose the rotten eggs, you know in the immediate aftermath that you could have had a million dollars.
Since one potential way of evaluating situations with Decision-Inconsistent Reference Classes is to simply compare averages, we’ll also define the Dilution Genie Puzzle. In this puzzle, a genie offers you $1,000,001 or $10. However, if the genie predicts that you will choose the greater amount, it creates 999,999 clones of you who will face what seems like an identical situation, but they will actually each only receive $1 when they inevitably choose the same option as you. This means that choosing $1,000,001 really provides an average of $2, so choosing $10 might actually be a better decision, though if you do actually take it you could have won the million.
Possible Approaches:
Suppose an agent G faces a decision D represented by input I. The most obvious approaches for evaluating these decision are as follows:
1) Individual Averages: If X is an option, calculate the expected utility of X by averaging over all agents who experience input I if G chooses X. Choose the option with the highest expected utility.
This approach defects on Imperfect Parfit’s Hitchhiker, chooses the rotten eggs for Evil Genie and chooses the $10 for Dilution Genie. Problems like Perfect Parfit’s Hitchhiker and Perfect Retro Blackmail are undefined as the reference class is empty. We can’t substitute 0 average utility for an empty reference class as in Perfect Parfit’s Hitchhiker, this results in us dying in the desert. We also can’t strike out these options and choose from those remaining since in Retro Blackmail this will result in us crossing out the option to not pay. So a major flaw with this approach is that it doesn’t handle problems where one decision invalidates the reference class.
It is also somewhat counterintuitive that individuals who count for one evaluating one possible option may not count in another for the same decision even if they still exist.
2) Pairwise Averages: If X & Y are options, compare these pairwise by calculating the average utility over all agents who experience input I if G chooses either X or Y. Non-existence is treated as a 0.
This approach pays in Perfect or Imperfect Parfit’s Hitchhiker, chooses the rotten eggs for Evil Genie, $1,000,001 for Dilution Genie and refuses to pay in Retro Blackmail.
Unfortunately, this doesn’t necessarily provide a consistent ordering, as we’ll shortly see. The following diagram represents what I’ll call the Staircase Prediction Problem because of the shape of the underlined entries:
#: 1 2 3 4 5 6 7
A: 0 1 0 0 0 0 0
B: 0 0 2 0 0 0 0
C: 0 0 0 0 1 0 0
There are 7 agents (numbered 1-7) who are identical clones and three different possible decisions (numbered A-C). None of the clones know which one they are. A perfect predictor predicts which option person 1 will pick if they are woken up in town, since they are clones, they will also choose the same option if they are woken up in town.
The underlined entries indicate which people will be woken up in town if it is predicted that person 1 will make that option and the non-underlined entries indicate who will be woken up on a plain. For those who are in town, the numbers indicate how much utility each agent is rewarded with if they choose that option. For those who aren’t in town, the agent is instead rewarded (or not) based on what they would counterfactually do if they were in town.
Comparing the lines pair-wise to see what decision we should make in town, we find:
B beats A (2/6 vs. 1⁄6)
C beats B (1/6 vs. 0⁄6)
A beats C (1/6 vs. 0⁄6)
Note that to be included in the average, a person only needs to be woken in town in one of the two options.
Since this provides an inconsistent ordering, this approach must be flawed.
3) Overall Averages: If X is an option, calculate the expected utility of X by averaging over all agents for which there is at least one option Y where they experience input I when G chooses Y. Non-existence is treated as a 0.
This approach is the same as 2) in many problems: it pays in Perfect or Imperfect Parfit’s Hitchhiker, chooses the rotten eggs for Evil Genie, $1,000,001 for Dilution Genie and refuses to pay in Retro Blackmail.
However, we run into issues with irrelevant considerations changing our reference classes. We will call this situation the Catastrophe Button Scenario.
#: 1 2 3
A: 3 3 0
B: 4 4 −1007
C: -∞ -∞ -∞
Again, underlined represents being woken up in town and non-underlined represents being woken up on the plain. As before, people that are woken up are based on the prediction of Person 1′s decision and agents that wake up in town don’t know who they are. C is the option representing pressing the Catastrophe Button. No-one wants to press this button as it leads to an unimaginably bad outcome. Yet, using overall averages, the presence of C makes us want to include Person 3 in our calculation of averages. Without person 3, A provides an average utility of 3 and B of 4. However, with person 3, A provides an average of 2 and B an average of −333. So the presence of the Catastrophe Button reverses the option we choose despite it being a button that we will never press and hence clone 3 never being woken up in town. This seems absurd.
But I care about all of my clones
We actually don’t need full clones in order to create these results. We can work with what I like to call semi-clones—agents that make exactly the same decisions in particular situations, but which have an incredibly different life story/preferences. For example, we could take an agent and change the country it was brought up in, the flavours of ice-cream it likes and its general personality, whilst leaving its decision theory components exactly the same. Even if you necessarily care about your clones, there’s much less reason for a selfish agent to care about its semi-clones. Or if that is insufficient, we can imagine that your semi-clones teamed up to murder your family. The only requirement for being a semi-clone is that they come to the same decision for a very restricted range of decision theory problems.
So if we make the individuals all different semi-clones, but keep them from knowing their identity, they should only care about the agents that were woken up in the town, as these are the only agents that are indistinguishable.
What about UDT?
UDT only insists on a utility function from the cross-product of execution histories of a set of programs to the real numbers and doesn’t define anything about how this function ought to behave. There is no restriction on whether it ought to be using the Self-Indicative Assumption or the Self-Sampling Assumption for evaluating execution histories with varying amount of agents. There is no requirement to care about semi-clones or not care about them.
The only real requirement of the formalism is to calculate an average utility over all possible execution histories weighted by the probability of them occurring. So, for example, in Imperfect Parfit’s Hitchhiker with a single agent, we can’t just calculate an average for the cases where you happen to be in town, but we need to assign a utility for when you are left in the desert. But if we ran the problem with 100 hitchhikers, one of whom would always be picked up independently of their decision, we could define a utility function that only took into account those who actually arrived in town. But this utility function isn’t just used to calculate the decision for one input, but an input-output map for all possible inputs and outputs. It seem ludicrous that decisions only relevant to the desert should be calculated just for those who end up in town.
Where does this leave us? UDT could technically represent Proposal 1, but in addition to the issue with empty reference classes, but seems to be an abuse of the formalisation. Proposal 2 is incoherent. Proposal 3 is very natural for UDT, but leads to irrelevant considerations affecting our decision.
So UDT doesn’t seem to tell us much about what we ought to do, nor provide a solution and even if it did, the specific approach would need to be justified rather than merely assumed.
What if we rejected personal identity?
If we argued that you shouldn’t care any more about what are traditionally seen as your future observer moments than anyone else’s, none of the scenarios discussed above would pose an issue. You would simply care about the average or total utility of all future person moments independent of whose they might appear to be. Of course, this would be a radical shift for most people.
What if we said that there was no best decision?
All of the above theories choose the rotten eggs in the Evil Genie problem, but none of them seem to give an adequate answer to the complaint that the decision isn’t reflectively consistent. So it seems like a reasonable proposal to suggest that the notion of a “best decision” depends on there being a fixed reference class. This would mean that there would be no real answer to Imperfect Parfit’s Hitchhiker. It would also require there to be significant modifications to UDT, but this is currently the direction that I’m leaning.
- Agent Meta-Foundations and the Rocket Alignment Problem by 9 Apr 2019 11:33 UTC; 12 points) (
- 25 Jul 2018 7:19 UTC; 12 points) 's comment on The Evil Genie Puzzle by (
- 31 Aug 2018 13:33 UTC; 2 points) 's comment on VOI is Only Nonnegative When Information is Uncorrelated With Future Action by (
Sorry, can you sort out your capital letters a bit? For example, A is both an agent and a choice, C is both a choice and a catastrophe button, and I have no idea what I is. So the main part of your post is really hard to parse for me.
I’ve edited it to resolve these issues. The agent is now G and the variables used for choices are X and Y. Hope that clears things up!
“Agents who experience I”—what’s I?
An input. I don’t know if that clarifies anything though?
You said O is an observation, and now I is an input. What’s the difference? I still don’t understand the whole setup.
I don’t know why I changed that. Perhaps an observation might be taken to imply the situation is real/possible. While an input is sometimes just a simulation/prediction. But what’s unclear?
(Edit: I seem to have left one mention of O and the word “observation” in the post, but it is now fixed)
What are “underlined entries”? I don’t see any underlines.
Sorry, I think there was an issue with the mobile editor. Fixed now. I changed from bold to underlined because bold infinities don’t really show up.
Okay, now I understand your post a bit better. You’re right that UDT doesn’t say whose welfare you should maximize. You have to tell it who you care about. But I’m not sure why that’s a problem for imperfect Parfit’s Hitchhiker. Maximizing the average welfare of all hitchhikers leads to the usual UDT answer. Same for the Catastrophe Button scenario, if persons 1, 2 and 3 all exist anyway, the simple-minded UDT answer doesn’t depend on the presence of the button. The problems only begin when you make the reference class depend on observations, so don’t do that :-)
“Maximizing the average welfare of all hitchhikers leads to the usual UDT answer”—sure, but I’m trying to figure out if that’s what we should be maximising or whether we should be maximising the average welfare of those who actually arrive in town.
“The simple-minded UDT answer doesn’t depend on the presence of the button”—You seem to be assuming that if an agent is included in a problem, that they have to be included in the utility calculation. But why couldn’t we make one of the agents an ape and limit the utility function to only optimise over humans? Actually, I’m confused now. You said, “UDT doesn’t say whose welfare you should maximize”, then you seemed to imply that we should maximise over all hitchhikers. What are you using to justify that?
“The problems only begin when you make the reference class depend on observations, so don’t do that :-)”—Sure, but how do we avoid this? If you know what agent you are, you can just optimise for yourself. But otherwise, the set of people who could be you is dependent on those who make the same observations.
Reflective consistency. If you think you might face Parfit’s Hitchhiker in the future (perfect or imperfect, no matter) and you can self-modify, you’ll self-modify to use the UDT solution. Studying reflectively inconsistent theories seems less useful to me, because AIs will be able to modify their code.
I’m only confident about the Catastrophe Button scenario if persons 1, 2 and 3 are clones. If they are only semi-clones, I agree with you that the justification for UDT is weaker.
Usually for toy UDT problems we assume that observations made within the toy problem don’t “split” your caring. Another question is how to apply that to real life, where decision problems don’t have sharp boundaries. Most likely you should unwind your beliefs to some fixed past moment and “freeze” them into your decision-making. This is the same issue as priors in Bayesian reasoning, how far do you go back to determine your prior? Is your belief that you’re a human a prior or posterior belief? These are interesting questions but I think they are separate from the theory in itself.
And how does this reflective consistency work (I’m not so sure this step can be abstracted out)? Do you average over all agents who were in an identical situation in the past? Or over all agents who were an equivalent situation (ie. one wakes up in a red room, another wakes up in an identical room except it is green, ect.)? Or over all agents even if they never had an experience remotely the same? And then there’s the issue of Pascal’s Mugging style problems that begin before the agent ever existed.
Doesn’t the fact that you’re uncertain about the answer seem to imply that UDT is under-defined?
Even if we’ve decided that observations don’t split caring, it isn’t clear which agents to include in the first place. And without knowing this, we can’t apply UDT.
How about this equivalence relation:
If two agents have utility functions that are perfectly cooperative with each other (not selfish), they are in the same reference class.
If two agents have a common ancestor, they are in the same reference class.
Take the transitive closure of (1) and (2).
For example, in Parfit’s Hitchhiker you’re in the same reference class as your ancestor before facing the problem, and in Catastrophe Button with perfect clones, all clones are in the same reference class.
Also the reference class doesn’t tell you how to average utilities, it just tells you whose decisions should be linked. There are many possible choices of averaging, for example you could have Alice and Bob whose decisions are linked but who both care about Alice’s welfare 10x more than Bob’s. That kind of thing should also be an input to UDT, not something dictated by it.
Interesting. Are there any posts or papers where I can read more about these issues, as I suspect that this is an area that is under-explored and hence probably worthwhile looking into?
This would seem to imply that in an infinite universe everyone will be in the same reference class under certain assumptions. That is, given an agent that experiences state A1 then A2 and another that experiences B1 then B2, there is almost certainly going to be an agent that experiences both state A1 then B1 then some other state C1.
I’m curious. Would you bite this bullet or do you think there’s an issue with how you formulated this?
When I said that the reference class tells you how to average utilities, I meant that it tells you how to average utilities for the purpose of accounting for uncertainty related to your identity.
Further, I think that it is reasonable to begin initial investigations by assuming perfect selfishness. Especially since it ought to be possible to transform any situation where utility functions include other people’s welfare to one where you only care about your own welfare. For example, if Alice experiences a welfare of 1 and Bob experiences a welfare of 2 and both care about Alice tens times as much as Bob, we can reformulate this as Alice experiencing a welfare of 12 and Bob experiencing a welfare of 12 and both parties being perfectly selfish.
How so? You and me weren’t cloned from the same person, and that fact doesn’t change even if the universe is infinite.
It can’t tell you that. Imagine that some event E happens with probability 1⁄4 and leads to 8 extra clones of you getting created. If E doesn’t happen, no clones are created. Both the original and all clones wake up in identical rooms and don’t know if E happened. Then each participant is asked if they’d prefer to get a dollar if E happened, or get a dollar if E didn’t happen. The reference class includes all participants, but the decision depends on how their utilities are aggregated: SIA says you should choose the former, SSA says you should choose the latter. That’s why measures of caring, like SIA or SSA, aren’t the same thing as reference classes and must be given to UDT separately.
Use → to mean “then”. My point was that if person A experiences A1 → A2, then A is in same reference class as the agent C who experiences A1 → B1 → C1 according to your common ancestor rule. Similarly, agent B who experiences B1->B2 is in the same reference class as C for the same reason. By transitivity A and B end up in the same reference class despite lacking a common “ancestor”. The infinite universe is what makes it overwhelmingly likely that such an agent C exists.
Perhaps I wasn’t clear enough in defining what is going on. Each state (A1, A2, B1...) defines not only the current stimuli that an agent is experiencing, but also includes their memory of the past as well (agents can forget things). The point is that these states should be indistinguishable to the agents experiencing them. Is that clear?
I don’t think that’s correctly. SIA and SSA should give the same answer for all bets (see If a Tree falls on Sleeping Beauty). SIA adjust for the number of agents who win in the probability, while SSA instead adjusts for this in the decision theory.
Edit: Actually, you are right in that the reference class includes all participants. Before I thought that SIA and SSA would have different reference classes.
I’m only trying to describe how UDT deals with certain bounded problems where we can tell which agents have which ancestors etc. Not trying to solve an infinite universe where all possible problems happen.
That seems limiting. We don’t need to believe in an infinite universe, just believe in a non-zero, non-infinitesimal chance of this. Or going even broader, Agent C doesn’t have to exist, but only needs a non-zero, non-infinitesimal chance in our prior. Why doesn’t this affect your use case?
I don’t know, maybe you’re right. Can you describe a bounded problem where the idea of common ancestors goes wrong?
Oh, I had a typo in my last comment. I wrote: “non-zero, infinitesimal chance” instead of “non-zero, non-infinitesimal chance”.
I wasn’t claiming that it goes wrong for a bounded problem (although it might). This is hard for me to answer as I don’t know what you mean by “bounded problem”. I’ll admit that this isn’t an issue for typical example problems as you are just told a scenario.
But when you are just given a bunch of observations as per real life and have to derive the situation yourself, you will have small amounts of probability on all kinds on weird and wacky theories. And the transitive closure means that an absolutely tiny probability of one scenario can dramatically change the reference class.
Not sure I understand. When simplifying a real life situation into UDT terms, why would you go from “Alice and Bob have a tiny chance of having a common ancestor” to “Alice and Bob are definitely part of the same reference class”?
Because UDT cares about possibilities that don’t occur, no matter how small the odds. Ie. Counterfactual Mugging with a 1 in a million chance of the coin coming up heads is still included in the possibility tree. Similarly, A and B are part of the same reference class if C is included in the possibility tree, no matter how small the odds. Not sure if that is clear though.
When I said “If two agents have a common ancestor, they are in the same reference class” I meant that they must have a common ancestor for certain, not just with nonzero probability. That’s also discontinuous (what if two agents have 99% probability of having a common ancestor?) but it works fine on many problems that are simplified from the real world, because simplification often involves certainty.
Do you mean that they must have a common ancestor if they exist or that they must have a common ancestor full stop?
Common ancestor full stop sounds more right.
Well, I suspect that the discontinuity will lead to strange results in practice, such as when you are uncertain of the past. For example, in Counterfactual Mugging, if there is any chance that you are a Boltzmann Brain who was created knowing/believing the coin was tails then you won’t have a common ancestor with brains that actually experienced the problem and saw tails, so you shouldn’t pay.But perhaps you don’t actually have to have an experience, but only believe that you had such an experience in the past?
All theories have limits of applicability. For example, Von Neumann-Morgenstern expected utility maximization requires the axiom of independence, which means you can’t be absent-minded (forgetting something and ending up in a previous mental state, like in the Absent-Minded Driver problem). If there’s even a tiny chance that you’re absent-minded, the problem can no longer be cast in VNM terms. That’s where UDT comes in, it can deal with absent-mindedness and many other things. But if there’s even a tiny chance of having more than one reference class, the problem can no longer be cast in UDT terms either. With multiple reference classes you need game theory, not decision theory.
I suppose the difference is that VNM states the limits within it operates, while I haven’t seen the limits of UDT described anywhere apart from this conversation.
I think you accidentally words.
“will change the reference class”