Why isn’t the following decision theory optimal?
I’ve recently read the decision theory FAQ, as well as Eliezer’s TDT paper. When reading the TDT paper, a simple decision procedure occurred to me which as far as I can tell gets the correct answer to every tricky decision problem I’ve seen. As discussed in the FAQ above, evidential decision theory get’s the chewing gum problem wrong, causal decision theory gets Newcomb’s problem wrong, and TDT gets counterfactual mugging wrong.
In the TDT paper, Eliezer postulates an agent named Gloria (page 29), who is defined as an agent who maximizes decision-determined problems. He describes how a CDT-agent named Reena would want to transform herself into Gloria. Eliezer writes
By Gloria’s nature, she always already has the decision-type causal agents wish they had, without need of precommitment.
Eliezer then later goes on the develop TDT, which is supposed to construct Gloria as a byproduct.
Gloria, as we have defined her, is defined only over completely decision-determined problems of which she has full knowledge. However, the agenda of this manuscript is to introduce a formal, general decision theory which reduces to Gloria as a special case.
Why can’t we instead construct Gloria directly, using the idea of the thing that CDT agents wished they were? Obviously we can’t just postulate a decision algorithm that we don’t know how to execute, and then note that a CDT agent would wish they had that decision algorithm, and pretend we had solved the problem. We need to be able to describe the ideal decision algorithm to a level of detail that we could theoretically program into an AI.
Consider this decision algorithm, which I’ll temporarily call Nameless Decision Theory (NDT) until I get feedback about whether it deserves a name: you should always make the decision that a CDT-agent would have wished he had pre-committed to, if he had previously known he’d be in his current situation and had the opportunity to precommit to a decision.
In effect, you are making an general precommittment to behave as if you made all specific precommitments that would ever be advantageous to you.
NDT is so simple, and Eliezer comes so close to stating it in his discussion of Gloria, that I assume there is some flaw with it that I’m not seeing. Perhaps NDT does not count as a “real”/”well defined” decision procedure, or can’t be formalized for some reason? Even so, it does seem like it’d be possible to program an AI to behave in this way.
Can someone give an example of a decision problem for which this decision procedure fails? Or for which there are multiple possible precommitments that you would have wished you’d made and it’s not clear which one is best?
EDIT: I now think this definition of NDT better captures what I was trying to express: You should always make the decision that a CDT-agent would have wished he had precommitted to, if he had previously considered the possibility of his current situation and had the opportunity to costlessly precommit to a decision.
Actually, if you push the precommittment time all the way back, this sounds a lot like an informal version of Updateless Decision Theory, which, by the way, seems to get everything that TDT gets right, plus counterfactual mugging and a lot of experiments that TDT gets wrong.
Are you implying that UDT is formal?
Fair enough. A less formal version of UDT. UDT at least has a formulation in Godel-Lob provability logic.
There’s one scenario described in this paper on which this decision theory gives in to blackmail:
I believe that NDT gets this problem right.
The paper you link to shows that a pure CDT agent would not self modify into an NDT agent, because a CDT agent wouldn’t really have the concept of “logical” connections between agents. The understanding that both logical and causal connections are real things is what would compel an agent to self-modify to NDT.
However, if there was some path by which an agent started out as pure CDT and then became NDT, the NDT agent would still choose correctly on Retro Blackmail even if the researcher had its original CDT source code. The NDT agent’s decision procedure explicitly tells it to behave as if it had precommitted before the researcher got its source code.
So even if the CDT --> NDT transition is impossible, since I don’t think any of us here are pure CDT agents, we can still adopt NDT and profit.
In the retro blackmail, CDT does not precommit to refusing even if it’s given the opportunity to do so before the researcher gets its source code. This is because CDT believes that the researcher is predicting according to a causally disconnected copy of itself, and therefore it does not believe that its actions can affect the copy. (That is, if CDT knows it is going to be retro blackmailed, and considers this before the researcher gets access to its source code, then it still doesn’t precommit.) The failure here is that CDT only reasons according to what it can causally affect, but in the real world decision algorithms also need to worry about what they can logically affect (For example, two agents created while spacelike separated should be able to cooperate on a Prisoner’s Dilemma.)
Your attempted patch (pretend you made your precommitments earlier in time) only works when the neglected logical relationships stem from a causal event earlier in time. This is often but not always the case. For instance, if CDT thinks that its clone was causally copied from its own source code, then you can get the right answer by acting as CDT would have precommitted to act before the copying occurred. But two agents written in spacelike separation from each other might have decision algorithms that are logically correlated, despite there being no causal connection no matter how far back you go.
In order to get the right precommitments in those sorts of scenarios, you need to formalize some sort of notion of “things the decision algorithm’s choice logically affects,” and formalizing “logical effects” is basically the part of the problem that remains difficult :-)
To clarify: you mean that CDT doesn’t precommit at time t=1 even if the researcher hasn’t gotten the code representing CDT’s state at time t=0 yet. The CDT doesn’t think precommitting will help because it knows the code the researcher will get will be from before its precommitment. I agree that this is true, and a CDT won’t want to precommit.
I guess my definition even after my clarification is ambiguous, as it’s not clear that what a CDT wishes it could have precomitted to at an earlier time should take precedence over what it would wish to precommit to at a later time. NDT seems to be best when you always prefer the earliest precommitment. The intuition is something like:
You should always make the decision that a CDT-agent would have wished he had precommitted to, if he had magically had the opportunity to costlessly precommit to to a decision at a time before the beginning of the universe.
This would allow you to act is if you had precommitted to things before you existed.
Can you give an example of this? Similar to the calculator example in the TDT paper, I’m imagining some scenario where one AI takes instructions for creating you to another galaxy, and another AI keeps a copy of the instructions for creating you on Earth. At some point, both AIs read the instructions and create identical beings, one of which is you. The AI that created you says that you’ll be playing a prisoner’s dilemma game with the other entity created in the same way, and asks for your decision.
In some sense, there is only a logical connection between these two entities because they’ve only existed for a short time and are too far away to have a causal effect on each other. However they are very causally related, and I could probably make an argument that they are replicas of the same person.
Do you have an example of a logical connection that has no causal connection at all (or as minimal a causal connection as possible)?
The universe begins, and then almost immediately, two different alien species make AIs while spacelike separated. The AIs start optimizing their light cones and meet in the middle, and must play a Prisoner’s Dilemma.
There is absolutely no causal relationship between them before the PD, so it doesn’t matter what precommitments they would have made at the beginning of time :-)
To be clear, this sort of thought experiment is meant to demonstrate why your NDT is not optimal; it’s not meant to be a feasible example. The reason we’re trying to formalize “logical effect” is not specifically so that our AIs can cooperate with independently developed alien AIs or something (although that would be a fine perk). Rather, this extreme example is intended to demonstrate why idealized counterfactual reasoning needs to take logical effects into account. Other thought experiments can be used to show that reasoning about logical effects matters in more realistic scenarios, but first it’s important to realize that they matter at all :-)
I think defect is the right answer in your AI problem and therefore that NDT gets it right, but I’m aware lots of LWers think otherwise. I haven’t researched this enough to want to argue it, but is there a discussion you’d recommend I read that spells out the reasoning? Otherwise I’ll just look through LW posts on prisoner’s dilemmas.
Secondly, I’d like to try to somehow incorporate logical effects into NDT. I agree they’re important. Any suggestions for where I could find lots of examples of decision problems where logical effects matter, to help me think about the general case?
That’s surprising to me. Imagine that the situation is “prisoner’s dilemma with shared source code”, and that the AIs inspect each other’s source code and verify that (by some logical but non-causal miracle) they have exactly identical source code. Do you still think they do better to defect? I wouldn’t want to build an agent that defects in that situation :-p
The paper that jessicat linked in the parent post is a decent introduction to the notion of logical counterfactuals. See also the “Idealized Decision Theory” section of this annotated bibliography, and perhaps also this short sequence I wrote a while back.
An AI should certainly cooperate if it discovered that by chance its opposing AI had identical source code.
I read your paper and the two posts in your short sequence. Thanks for the links. I still think it’s very unlikely that one of the AIs in your original hypothetical (when they don’t examine each other’s source code) would do better by defecting.
I accept that if an opposing AI had a model of you that was just decent but not great, then there is some amount of logical connection there. What I haven’t seen is any argument about the shape of the graph of logical connection strength vs similarity of entities. I hypothesize that for any two humans who exist today, if you put them in a one shot PD, the logical connection is negligible.
Has anyone written specifically on how exactly to give weights to logical connections between similar but non-identical entities?
Nope! That’s the open part of the problem :-) We don’t know how to build a decision network with logical nodes, and we don’t know how to propagate a “logical update” between nodes. (That is, we don’t have a good formalism of how changing one algorithm logically affects a related but non-identical algorithm.)
If we had the latter thing, we wouldn’t even need the “logical decision network”, because we could just ask “if I change the agent, how does that logically affect the universe?” (as both are algorithms); this idea is the basis of proof-based UDT (which tries to answer the problem by searching for proofs under the assumption “Agent()=a” for various actions). Proof based UDT has lots of problems of its own, though, and thinking about logical updates in logical graphs is a fine angle of approach.
Thanks. I had one question about your Toward Idealized Decision Theory paper.
I can’t say I fully understand UDT, but the ‘updateless’ part does seem very similar to the “act as if you had precommitted to any action that you’d have wanted to precommit to” core idea of NDT. It’s not clear to me that the super powerful UDT would make the wrong decision in the game where two players pick numbers between 0-10 and get payouts based on their pick and the total sum.
Wouldn’t the UDT reason as follows? “If my algorithm were such that I wouldn’t just pick 1 when the human player forced me into it by picking 9 (for instance maybe I always pick 5 in this game), then I may still have a reputation as a powerful predictor but it’s much more likely that I’d also have a reputation as an entity that can’t be bullied like this, so the human would be less likely to pick 9. That state of the world is better for me, so I shouldn’t be the type of agent that makes the greedy choice to pick 1 when I predict the human will pick 9.”
The argument in your paper seems to rely on the human assuming the UDT will reason like a CDT once it knows the human will pick 9.
Yep, that’s a common intuition pump people use in order to understand the “updateless” part of UDT.
A proof-based UDT agent would—this follows from the definition of proof-based UDT. Intuitively, we surely want a decision theory that reasons as you said, but the question is, can you write down a decision algorithm that actually reasons like that?
Most people agree with you on the philosophy of how an idealized decision theory should act, but the hard part is formalizing a decision theory that actually does the right things. The difficult part isn’t in the philosophy, the difficult part is turning the philosophy into math :-)
Let’s say you precommit to never paying off blackmailers. The advantage of this is that you are no longer an attractive target for blackmailers since they will never get paid off. However if someone blackmails you anyway, your precommitment now puts you at a disadvantage, so now (NDT)you would act as if you had a precommitment to comply with the blackmailers all along since at this point that would be an advantageous precommitment to have made.
The harder part is precisely defining what constitutes blackmail.
It seems that this is more of a bluff than a true precommitment.
I think my definition of NDT above was worded badly. The problematic part is “if he had previously known he’d be in his currently situation.” Consider this definition:
You should always make the decision that a CDT-agent would have wished he had precommitted to, if he previously considered the possibility of his current situation and had the opportunity to costlessly precommit to a decision.
The key is that the NDT agent isn’t behaving as if he knew for sure that he’d end up blackmailed when he made his precommitment (since his precommitment affects the probability of his being blackmailed), but rather he’s acting “as if” he precommitted to some behavior based on reasonable estimates of the likelihood of his being kidnapped in various cases.
But if you pay the blackmailer, then you didn’t precommit to not paying him, in which case you’ll wish you did since then you probably wouldn’t get blackmailed. You’ll act as if you precommit if and only of you do not.
Perhaps you’d end up precommitting to some probability of paying the blackmailer?
Isn’t “NDT” just CDT + reflective consistency? There are a number of reasons why that might fail. (Can’t think of any right now, though—I’ll get back to you.)
Pretty much. The fact that NDT is so obvious is why I’m puzzled as to why TDT needed to be created, and why Eliezer didn’t end his paper shortly after the discussion of Gloria. NDT seems to get all the tricky decision problems right, even at least one that TDT gets wrong, so what am I missing?
There is no perfect decision procedure which is beneficial in all possible situations. If the situation is able to know the agent’s decision procedure, it can act in such a way as to “minimize the agent’s utility if the agent uses decision procedure X”, and in that situation decision procedure X, whatever it is, will be bad for the agent. So in order to have perfect knowledge of what decision procedure to use, you have to know what situations are going to actually happen to you.
Eliezer talked about this in his TDT paper. It is possible to hypothesize scenarios where agents get punished or rewarded for arbitrary reasons. For instance an AI could punish agents who made decisions based on the idea of their choices determining the results of abstract computations (as in TDT). This wouldn’t show that TDT is a bad decision theory or even that it’s no better than any other theory.
If we restrict ourselves to action-determined and decision-determined problems (see Eliezer’s TDT paper) we can say that TDT is better than CDT, because it gets everything right that CDT gets right, plus it gets right some things that CDT gets wrong.
Can you think of any way that a situation could be set up that punishes an NDT agent, that doesn’t reduce to an AI just not liking NDT agents and arbitrarily trying to hurt them?
This sounds a lot like the objections CDT people were giving to Newcombs problem.
Does it? I’m not so sure.
Anyhow, the short answer is that the reason people have done a bunch of extra work is because we don’t just want an English-language explanation of what happens, we want to describe a specific computation. Not that the verbal descriptions aren’t really useful, but precision has its merits; it often takes stating a specific algorithm to realize that your algorithm does something you don’t want, and you actually have to go back and revise your verbal description.
For example, a decision algorithm based on precommitment is unable to hold selfish preferences (valuing a cookie for me more than a cookie for a copy of me) in anthropic situations (apologies for how messy that series of posts is). But since I’m of the opinion that it’s okay to have selfish preferences, this means that I need to use a more general model of what an ideal decision theory looks like.
I disagree that it makes sense to talk about one of the future copies of you being “you” whereas the other isn’t. They’re both you to the same degree (if they’re exact copies).
I agree with you there—what I mean by selfish preferences is that after the copies are made, each copy will value a cookie for itself more than a cookie for the other copy—it’s possible that they wouldn’t buy their copy a cookie for $1, but would buy themselves a cookie for $1. This is the indexically-selfish case of the sort of preferences people have that cause them to buy themselves a $1 cookie rather than giving that $1 to GiveDirectly (which is what they’d do if they made their precommitments behind a Rawlsian veil of ignorance).
Confused. What’s incoherent about caring equally about copies of myself, and less about everyone else?
I don’t think I said it was incoherent. Where are you getting that from?
To expand on a point that may be confusing: indexically-selfish preferences (valuing yourself over copies of you) will get precommitted away if you are given the chance to precommit before being copied. Ordinary selfish preferences would also get precommitted away, but only if you had the chance to precommit sometime like before you came into existence (this is where Rawls comes in).
So if you have a decision theory that says “do what you would have precommitted to do,” well, you end up with different results depending on when people get to precommit. If we start from a completely ignorant agent and then add information, precommitting at each step, you end up with a Rawlsian altruist. If we just start form yesterday, then if you got copied two days ago you can be indexically selfish but if you got copied this morning you can’t.
The problem is that Rawls gets the math wrong even in the case he analyzes.