From my understanding reading through the site, each successive decision theory builds off of previous models in an upward progression: CDT/EDT is surpassed by the “winning” paradigm of TDT, which in turn is surpassed by the updateless paradigm of UDT/FDT. And the goal that these decision theories move towards is one that enables an agent to “understand, agrees with, and deeply believe in human morality” (from Superintelligence FAQ). In other words, creating a mathematical model that correctly reproduces the same impetus of human morality.
With that in mind, I start to notice certain trends across the advancements in Decision Theory that reflects what kind of ideology a truly moral AGI would use. CDT, the most straightforward and naïve decision theory, is unaligned in a way that appears to be highly amoral. In Parfit’s Hitchhiker, for example, our intuition says that rewarding someone who has gone out of their way to help you is the morally right thing to do. But in CDT, as soon as the agent obtains what it wants (getting home) then it has no incentive to pay.
Any kind of a precommitment like a contract or a constitution would be meaningless to a CDT agent, because as soon as the agent is in a position to achieve the highest utility by violating the contract then it will do so (assuming it isn’t under immediate threat of punishment). CDT essentially seeks its own immediate gain (maximize utility of its actions) with little regard towards others.
But precommitting is the hallmark of timelessness, so TDT has the special property of being able to keep a promise, because fulfilling its promises in the future will be able to achieve the best result in the present. This aligns a lot closer with a human concept of morality, as a TDT agent will faithfully hold to its word even in situations that doesn’t give it an immediate benefit, because it is factoring in the long-term benefit of cooperation (thinking back to Parfit’s Hitchhiker).
In a way, TDT also aligns with our concept moral codes and punishment. Even though punishing an agent doesn’t casually affect the actions it has already done, it does deter them from committing those actions in the future, and knowing about the punishment a priori would deter the agent from considering the action in the first place (akin to Acasual Blackmail).
Adding in updatelessness takes another step closer towards human morality. I found Tyrrell’s post about UDT agents as deontologists to be particularly eye-opening, if not the best explanation for Counterfactual Mugging I’ve seen, although I don’t fully get why his post was so poorly received. As he described it:
The act-evaluating function is just a particular computation which, for the agent, constitutes the essence of rightness… Merely doing the right thing has given the agent all the utility it could hope for… although the agent lost $100, it really gained from the interaction
For an agent that is updateful, it may choose different actions in different situations to maximize utility for the observations that it updates to (thus not considering counterfactuals). This becomes morally inconsistent, where the updateful agent may lie or steal (or in this case, not pay Omega) as long as the consequences of these actions are relegated as a counterfactual.
The updateless agent, however, considers its policy function based on the probability distribution across all possible worlds. This, to me, appears similar to the Kantian approach to evaluate what actions are universally “good”, by judging if such actions would be permissible in all possible scenarios (weighted by their probability). That is why calling the UDT agent a “deontologist”, as in Tyrrell’s post, makes the most sense to me.
As a possible analogy, suppose we set aside Omega and instead take a more religious setting, the Parable of the Good Samaritan (Luke 10:30-37). For centuries, Christian theology lauds the selfless act of the Samaritan to help this stranger, at great cost to himself, and expect no benefit in return. But in reality, the Samaritan here is considering the probability of the counterfactual situation where their roles being reversed. What if the Samaritan was attacked by muggers instead, and it was the Jew who found his body? As long as the Samaritan doesn’t plan to run into muggers in the future, he can consider this scenario a counterfactual. However, in this passage the probability of this counterfactual is extremely high, possibly even the same 50:50 split as Omega’s coin (remember, the Jew and Samaritan both left the same city, and the muggers chose to attack one of them at random). If the Samaritan doesn’t help the Jew now, then the Jew wouldn’t help him out in the counterfactual situation where the roles are reversed.
The reason I use this analogy is to illustrate how updatelessness, by the nature of modeling human moral systems, incidentally becomes the crux of religious ethics, as Jesus said “in everything, do to others what you would have them do to you, for this sums up the Law and the Prophets.” (Matthew 7:12). It is this same sermon that Jesus also stated his opinion on timelessness:
You have heard that it was said to those of old, ‘You shall not swear falsely, but shall perform your oaths to the Lord.’ But I say to you, do not swear at all… But let your ‘Yes’ be ‘Yes,’ and your ‘No,’ ‘No.’ For whatever is more than these is from the evil one. (Matthew 5:33-37)
So in summary, it seems like the progression of decision theory from CDT/EDT → TDT → UDT/FDT moves in a direction towards a kind of moral absolutism found in religious or Kantian ethics. And to some degree, it makes intuitive sense why we would desire to make an AI moral absolutist. The purpose of FAI is not only to make an agent that does good, but does so reliably and predictably. If an AI is making moral choices based on intuition or subjectivity, then that automatically makes it less predictable. In fact, we want to hold an AI to a an even greater moral standard than we do to each other, largely because of the larger responsibility AI is anticipated to have.
If God made man in his image, and we make AI in our image, then by transitivity we desire an AI that lives up to the holiness we associate with God.
What do you think? Are these connections more of a coincidence? Do you think Jesus or other religious figures could have been early pioneers of decision ahead of their time?
A TDT agent does not “make promises.” It behaves as if it made promises, when such behavior helps it—and when promises are not necessary and deception suffices, it will do that too.
Isn’t a deceptive agent the hallmark of unfriendly AI? In what scenarios does a dishonest agent reflect a good design?
Of course, I didn’t mean to say that TDT always keeps its promises, just that it is capable of doing so in scenarios like Parfit’s Hitchhiker, where CDT is incapable of doing so.
Decision theories are explicitly not moral. It is easy to construct scenarios where agents acting according to any of them will lie, cheat, steal, and murder.
You can probably get something like a moral theory from some bunch of assumptions about preferences + various types of shared interests + universally applying some decision theory. But it still won’t be a moral theory, and I doubt you can get anything like most human (including religious) morality out of it.