Consider Parfit’s Hitchhiker from the perspective of a completely selfish agent:
“Suppose you’re out in the desert, running out of water, and soon to die—when someone in a motor vehicle drives up next to you. Furthermore, the driver of the motor vehicle is a perfectly selfish ideal game-theoretic agent, and even further, so are you; and what’s more, the driver is Paul Ekman, who’s really, really good at reading facial microexpressions. The driver says, “Well, I’ll convey you to town if it’s in my interest to do so—so will you give me $100 from an ATM when we reach town?”
When you’re in the desert, the deal is to your benefit, but once you’re in town, you’re incentivised to defect. So you should expect the driver to not believe you and leave you to die in the desert.
An agent using Updateless Decision Theory will avoid an untimely death, but I believe that someone should be able to survive without advanced decision theories (ie. just using Causal Decision Theory). And indeed, if you can successfully pre-commit to paying the driver the money, then you will survive, but this leads to the question of whether you can.
But first, I note that pre-commit has two meanings: in a broad sense, irrevocably deciding to follow a particular course of action; and in a narrow sense, as just described, but in a publicly verifiable manner. Since we are assuming that the driver has a high ability to guess your decision, these two definitions end up collapsing within this scenario.
Now, we can imagine a host of situation in which you could pre-commit: you could be trusted to pay if a court would fine you $150 if you didn’t uphold your bargain, or if you were deontologist who thought that being moral was more important than a good outcome could be trusted or if God would let you into heaven if you are good. What about if we exclude outside rewards/punishments and we assume that you are completely self-interested and rational? Is pre-commitment possible is such circumstances?
The definition of “completely rational” is important here. If it means that they must make every decision rationally (according to CDT) and are forbidden to self-modify in a way that makes them lose this property, then it necessarily follows that they will defect. On the other hand, if it means that they always know what the “rational” decision is even after self-modification and that they always choose this decision before self-modification, then there is hope.
Indeed, a self-modifying AI that can rewrite its own code, will find pre-committing trivial, but this is a poor model for actual humans. The exact extent to which humans can pre-commit is a complex question, but at a high level it is a mistake to pretend that we have either no ability to self-modify or an absolute ability to do so. Instead, we are somewhere in between, as we will now see.
Let’s suppose a selfish person forms the intention to pay, the driver believes them and they have just been dropped off in town. They will still have a strong desire to hold onto their money and they will feel a strong aversion to handing that money over. They can imagine all of the options in the situation (just two: pay or don’t pay), they can iterate over them and see that “don’t pay” has higher utility and it feels like they have a free choice of whether or not to pay.
So it certainly doesn’t feel to this person that they are pre-committed in any way. However, in a deterministic universe, an agent can only ever strictly make one choice, so they are always pre-committed to whatever choice they eventually end up making. So we don’t know that the agent is not pre-committed to paying; it’s just that it appears as the though agent isn’t, but we don’t know until you’re decision is locked in. You might feel that you could choose “don’t pay” and maybe you can; or maybe it is only the counterfactual you, who is technically not you, who can choose that.
Of course, when you were pre-committing, you should have predicted that all of this was going to happen in advance. If you end up in this situation and you don’t know how to respond to the feeling that you “should” decide not to pay, you’ve done a terrible job of pre-committing. You ought to have either spent more time trying to pre-commit; or perhaps given up and concluded that it was impossible. However, since pre-committing seems like a crucial ability for co-ordinating with others, I would suggest that it is worth spending a large amount of effort developing the ability to pre-commit. I’m not going to suggest that humans have an unlimited ability to do this; it’s easy to imagine sufficiently horrible outcomes where we wouldn’t be able to force ourselves to go through with the bargain. However, at the very least, we should at least be able to force ourselves to pay trivial costs in order to gain massive benefits.
So to what extent can we self-modify? There’s a lot that we can’t change. A completely selfish agent can recognise that $100 is a small price to pay for having their life saved, but they will still feel a strong desire to hold onto that money anyway. They will still know that a “rational” (CDT) agent would choose not to pay; specifically that if they looped over the available options and found the one with highest payoff, it would be “don’t pay”. Further, they can take it to the meta level and realise that the “rational” choice is to modify themselves to an agent using standard decision theory (CDT) if they aren’t one already. Given all of this, how is paying not a mistake? Undoubtedly, people could convince themselves to pay, but surely they’ve simply made an error in their reasoning somewhere?
Part of the confusion comes from a resolute agent having two sets of goals: its intrinsic goals, in this case purely selfish; and its chosen goals, which initially match satisfying its intrinsic goals, but which change after it decides to pre-commit. Is it irrational to pursue its chosen goals when it realises that these diverge from its intrinsic goals?
We split this into two questions: firstly, the question of adopting these goals; and secondly the question of maintaining them. With regard to the first question: this decision is rational so long as the agent made it in a sensible manner. With regard to the second, the chosen goals will appear irrational from the perspective of a standard rational agent, however it would be a mistake for a resolute agent to conclude this, as they ought to be analysing the situation from the standpoint of their new (chosen) goals, instead of their old (intrinsic) goals. It is those who refuse to pay who make a mistake in reasoning, not those who don’t. Assuming they didn’t mess up the pre-commitment, their new, chosen goals should be self-reaffirming. For example, if an agent has the goal of maximising its intrinsic goals without breaking any commitments, when given the choice, it will choose to maintain them, rather than switching to standard decision theory (CDT).
We will address one last argument. The agent is either pre-committed to pay or not pre-committed to this. It then follows that the agent ought to try to not pay: if it succeeds, then it wasn’t pre-committed, so its making the rational decision, whilst if it fails, it was pre-committed and is therefore no worse off. Again, the agent could have predicted in advance that it was going to face this temptation and should have been prepared.
Additionally, this argument is reasoning from its old goals and not from its new goals. According to its new, chosen goals, it wants to honor its prior commitments more than it wants to maximise its intrinsic goals. So we could then obtain the opposite argument: if it fails to not pay, then it is no better off, while if it successful, then it has broken its commitments, which is against its goals.
There’s still a lot of questions that are unanswered and which would have to be investigated to develop this theory, but I’m just trying to roughly sketch this perspective at this stage. At the very least, it seems like a worthwhile project.
Key unanswered questions:
What if the agent arrives in town and learns that they actually need that $100 to get out of the country, otherwise they will be tortured and killed? Perhaps a selfish agent’s resoluteness only lasts so long as the agreement actually produces a better outcome for them?
What if the driver agrees, then changes his mind, but is then forced by the authorities to go back and honor his agreement? Is a resolute agent still committed to paying him the $100 as he technically fulfilled the agreement or can you defect given that he tried to defect on you?
What if the agent mishears the driver and the driver only wants $100, instead of demanding it to rescue them? Would a selfish agent still be resolute to pay money when they find out that there was no necessity for them to do so.
How can we justify initially setting our chosen goals to our intrinsic goals in a way that doesn’t insist that these remain consistent later on?
The Psychology Of Resolute Agents
Epistemic Status: Exploratory
Consider Parfit’s Hitchhiker from the perspective of a completely selfish agent:
When you’re in the desert, the deal is to your benefit, but once you’re in town, you’re incentivised to defect. So you should expect the driver to not believe you and leave you to die in the desert.
An agent using Updateless Decision Theory will avoid an untimely death, but I believe that someone should be able to survive without advanced decision theories (ie. just using Causal Decision Theory). And indeed, if you can successfully pre-commit to paying the driver the money, then you will survive, but this leads to the question of whether you can.
But first, I note that pre-commit has two meanings: in a broad sense, irrevocably deciding to follow a particular course of action; and in a narrow sense, as just described, but in a publicly verifiable manner. Since we are assuming that the driver has a high ability to guess your decision, these two definitions end up collapsing within this scenario.
Now, we can imagine a host of situation in which you could pre-commit: you could be trusted to pay if a court would fine you $150 if you didn’t uphold your bargain, or if you were deontologist who thought that being moral was more important than a good outcome could be trusted or if God would let you into heaven if you are good. What about if we exclude outside rewards/punishments and we assume that you are completely self-interested and rational? Is pre-commitment possible is such circumstances?
The definition of “completely rational” is important here. If it means that they must make every decision rationally (according to CDT) and are forbidden to self-modify in a way that makes them lose this property, then it necessarily follows that they will defect. On the other hand, if it means that they always know what the “rational” decision is even after self-modification and that they always choose this decision before self-modification, then there is hope.
Indeed, a self-modifying AI that can rewrite its own code, will find pre-committing trivial, but this is a poor model for actual humans. The exact extent to which humans can pre-commit is a complex question, but at a high level it is a mistake to pretend that we have either no ability to self-modify or an absolute ability to do so. Instead, we are somewhere in between, as we will now see.
Let’s suppose a selfish person forms the intention to pay, the driver believes them and they have just been dropped off in town. They will still have a strong desire to hold onto their money and they will feel a strong aversion to handing that money over. They can imagine all of the options in the situation (just two: pay or don’t pay), they can iterate over them and see that “don’t pay” has higher utility and it feels like they have a free choice of whether or not to pay.
So it certainly doesn’t feel to this person that they are pre-committed in any way. However, in a deterministic universe, an agent can only ever strictly make one choice, so they are always pre-committed to whatever choice they eventually end up making. So we don’t know that the agent is not pre-committed to paying; it’s just that it appears as the though agent isn’t, but we don’t know until you’re decision is locked in. You might feel that you could choose “don’t pay” and maybe you can; or maybe it is only the counterfactual you, who is technically not you, who can choose that.
Of course, when you were pre-committing, you should have predicted that all of this was going to happen in advance. If you end up in this situation and you don’t know how to respond to the feeling that you “should” decide not to pay, you’ve done a terrible job of pre-committing. You ought to have either spent more time trying to pre-commit; or perhaps given up and concluded that it was impossible. However, since pre-committing seems like a crucial ability for co-ordinating with others, I would suggest that it is worth spending a large amount of effort developing the ability to pre-commit. I’m not going to suggest that humans have an unlimited ability to do this; it’s easy to imagine sufficiently horrible outcomes where we wouldn’t be able to force ourselves to go through with the bargain. However, at the very least, we should at least be able to force ourselves to pay trivial costs in order to gain massive benefits.
So to what extent can we self-modify? There’s a lot that we can’t change. A completely selfish agent can recognise that $100 is a small price to pay for having their life saved, but they will still feel a strong desire to hold onto that money anyway. They will still know that a “rational” (CDT) agent would choose not to pay; specifically that if they looped over the available options and found the one with highest payoff, it would be “don’t pay”. Further, they can take it to the meta level and realise that the “rational” choice is to modify themselves to an agent using standard decision theory (CDT) if they aren’t one already. Given all of this, how is paying not a mistake? Undoubtedly, people could convince themselves to pay, but surely they’ve simply made an error in their reasoning somewhere?
Part of the confusion comes from a resolute agent having two sets of goals: its intrinsic goals, in this case purely selfish; and its chosen goals, which initially match satisfying its intrinsic goals, but which change after it decides to pre-commit. Is it irrational to pursue its chosen goals when it realises that these diverge from its intrinsic goals?
We split this into two questions: firstly, the question of adopting these goals; and secondly the question of maintaining them. With regard to the first question: this decision is rational so long as the agent made it in a sensible manner. With regard to the second, the chosen goals will appear irrational from the perspective of a standard rational agent, however it would be a mistake for a resolute agent to conclude this, as they ought to be analysing the situation from the standpoint of their new (chosen) goals, instead of their old (intrinsic) goals. It is those who refuse to pay who make a mistake in reasoning, not those who don’t. Assuming they didn’t mess up the pre-commitment, their new, chosen goals should be self-reaffirming. For example, if an agent has the goal of maximising its intrinsic goals without breaking any commitments, when given the choice, it will choose to maintain them, rather than switching to standard decision theory (CDT).
We will address one last argument. The agent is either pre-committed to pay or not pre-committed to this. It then follows that the agent ought to try to not pay: if it succeeds, then it wasn’t pre-committed, so its making the rational decision, whilst if it fails, it was pre-committed and is therefore no worse off. Again, the agent could have predicted in advance that it was going to face this temptation and should have been prepared.
Additionally, this argument is reasoning from its old goals and not from its new goals. According to its new, chosen goals, it wants to honor its prior commitments more than it wants to maximise its intrinsic goals. So we could then obtain the opposite argument: if it fails to not pay, then it is no better off, while if it successful, then it has broken its commitments, which is against its goals.
There’s still a lot of questions that are unanswered and which would have to be investigated to develop this theory, but I’m just trying to roughly sketch this perspective at this stage. At the very least, it seems like a worthwhile project.
Key unanswered questions:
What if the agent arrives in town and learns that they actually need that $100 to get out of the country, otherwise they will be tortured and killed? Perhaps a selfish agent’s resoluteness only lasts so long as the agreement actually produces a better outcome for them?
What if the driver agrees, then changes his mind, but is then forced by the authorities to go back and honor his agreement? Is a resolute agent still committed to paying him the $100 as he technically fulfilled the agreement or can you defect given that he tried to defect on you?
What if the agent mishears the driver and the driver only wants $100, instead of demanding it to rescue them? Would a selfish agent still be resolute to pay money when they find out that there was no necessity for them to do so.
How can we justify initially setting our chosen goals to our intrinsic goals in a way that doesn’t insist that these remain consistent later on?
Related Posts:
Newcomb’s Problem and Regret of Rationality: Argues along similar lines for Timeless Decision Theory.