Weird Random Newcomb Problem
Epistemic status: I’m pretty sure the problem is somewhat interesting, because it temporarily confused several smart people. I’m not at all sure that it is very original; probably somebody has already thought about something similar. I’m not at all sure that I have actually found a flaw in UDT, but I somewhat expect that a discussion of this problem may clarify UDT for some people.
This post emerged from my work in the “Deconfusing Commitment Races” project under the Supervised Program for Alignment Research (SPAR), led by James Faville. I’m grateful to SPAR for providing the intellectual environment and to James Faville personally for intellectual discussions and help with the draft of this post. Any mistakes are my own.
I used Claude and Gemini to help me with phrasing and grammar in some parts of this post.
Scenario
Let be the set of all programs in a fixed programming language that satisfy the following conditions:
They take a pair of integers as input.
They output either the string “take 1 box” or “take 2 boxes”.
They always halt within steps, where is a very large constant (e.g., ).
They contain no unreachable code.
is a finite, albeit extremely large, set.
Let be some fixed bijective numbering of programs from .
Omega presents you a variation of Newcomb Problem.
Omega randomly selects a program uniformly from .
Omega runs with the input .
If outputs “take 1 box”, Omega puts $1,000,000 in the first box.
If outputs “take 2 boxes”, Omega leaves the first box empty.
As in the standard Newcomb Problem, Omega always puts $1,000 in the second box.
However, the program that decides whether you get the money from one or from both boxes is not (necessarily) . Let’s call this program . Program is also an element of . It receives the pair as input – its own number according to the numbering , and the number of the program that Omega randomly selected. Based on this input, must output either “take 1 box” or “take 2 boxes”.
Questions
Question 1: Assume you are program . You want to maximize the money you receive. What should you output if your input is (i.e., the two numbers are equal)?
Question 2: Assume you are the programmer writing program . You want to maximize the expected money program receives. How should you design b to behave when it receives an input ?
(Feel free to pause and consider these questions before reading further.)
-
-
-
-
Question 1 appears analogous to the standard Newcomb Problem. Omega ran your own code (acting as ) on the same input you received to determine whether to place the $1,000,000 in the first box. So it seems you should take one box.
But in Question 2 it’s better to write the program which always takes 2 boxes! Consider the programmer choosing between implementing b as one of two specific programs:
: Outputs “take 1 box” if the numbers in its input are equal, and “take 2 boxes” otherwise.
: Always outputs “take 2 boxes”, regardless of input.
gets additional $1,000 doesn’t get when , and gets the same payoff as in all other cases. The probability for to be any specific program is independent from . So is strictly better than .
So, if you are the program, you prefer to choose one action. But if you are the programmer who writes this program, you prefer it to choose another action in the same circumstances.
Isn’t it normal?
At first glance there are many problems like this. Justifications of advanced decision theories often use problems with this property. Usually their discussion ends with something like ”...and that’s why you should follow the optimal policy even if you didn’t explicitly precommit to it beforehand”. It follows the argument in one of the following framings:
“Universal precommitment” framing: You prefer to have an optimal policy. Sometimes the optimal policy includes locally non-optimal decisions (e.g., if someone predicts your policy). So you would like to make a precommitment for such cases. You can’t think about all possible situations in advance, so it’s better to make a precommitment “I will follow any precommitment which would be a good idea to make in advance”. It would be a good idea to precommit to take one box in the normal Newcomb Problem, to pay the driver in Parfit’s Hitchhiker, to pay Omega in Counterfactual Mugging. So you do it.
“Functional” framing: You control the input-output behavior of your decision-making function. This function can be instantiated not only in you, but in some other places, e.g., in someone who predicts your behavior. You necessarily control all instances at the same time. You prefer the instances in predictions to take one box/to pay the driver/to pay Omega. So you do it.
“Anthropic” framing: You actually don’t know if you are the “real you” or the simulation of yourself in the good prediction (otherwise it wouldn’t be good). So normal causal expected utility calculations tell you it’s better to take one box (it’s -$1K if you are “real you”, but +$1M if you are a simulation, and chances are 50-50)/to pay the driver (-$100 and +life, chances of being a simulation are at least 50%)/to pay Omega (+$500 and -$100, chances of being a simulation are 2 to 1).
No, it’s weird (I think)
As we see, these framings usually point in the same direction. But not here! When the program receives two equal numbers as an input, advice from these approaches is:
“Universal precommitment”: The optimal policy is to always take two boxes. Do it!
“Functional”: You control yourself and Omega’s program (because both are the same program), and it’s better for you if you both take one box. So leave $1K on the table to get a million!
“Anthropic”: You almost certainly are Omega’s program and not the player’s program. Take one simulated box to put $1M in the real one!
So here the “universal precommitment” approach is in conflict with the two other approaches. And for me personally, the advice to take both boxes (if you are a program; if you are a programmer, you definitely should write a program which will always take two boxes, no problem with that) here is much more counterintuitive than “universal precommitment” advice in all other decision theory problems I know. I think the “universal precommitment” framing is the closest to what UDT actually means, so now my confidence in it has been somewhat shaken.
I think it’s weird. Do you?
(Just skimmed, also congrats on the work)
Why is this surprising? You’re basically assuming that there is no correlation between what program Omega predicts and what program you actually are. That is, Omega is no predictor at all! Thus, obviously you two-box, because one-boxing would have no effect on what Omega predicts. (Or maybe the right way to think about this is: it will have a tiny but non-zero effect, because you are one of the |P| programs, but since |P| is huge, that is ~0.)
When instead you condition on a = b, this becomes a different problem: Omega is now a perfect predictor! So you obviously one-box.
Another way to frame this is whether you optimize through the inner program or assume it fixed. From your prior, Omega samples randomly, so you obviously don’t want to optimize through it. Not only because |P| is huge, but actually more importantly, because for any policy you might want to implement, there exists a policy that implements the exact opposite (and might be sampled just as likely as you), thus your effect nets out to exactly 0. But once you update on (condition on) seeing that actually Omega sampled exactly you (instead of your random twin, or any other old program), then you’d want to optimize through a! But, you can’t have your cake and eat it too. You cannot shape yourself to reap that reward (in the world where Omega samples exactly you), without also shaping yourself to give up some reward (relative to other players) in the world where Omega samples your evil twin.
(Thus you might want to say: Aha! I will make my program behave differently depending on whether it is facing itself or an evil twin! But then… I can also create an evil twin relative to that more complex program. And we keep on like this forever. That is, you cannot actually, fully, ever condition your evil twin away, because you are embedded in Omega’s distribution. I think this is structurally equivalent to some commitment race dynamics I discussed with James, but I won’t get into that here.)
Secretly, I think this duality only feels counter-intuitive because it’s an instance of dynamic inconsistency, = you want to take different actions once you update on information, = the globally (from the uncorrelated prior) optimal action is not always the same as the locally optimal action (from a particular posterior, like the one assuming correlation). Relatedly, I think the only reason your Universal framing differs from your Functional and Anthropic framings is exactly that they are (implicitly) using these two different distributions (one without correlation, the other with):
The Universal framing assumes that Omega samples randomly (no correlation).
The Functional framing assumes that “you have control over the inner program”. But this is sneaking something in. You obviously have control over your own program. But you don’t actually have any control over “the program that Omega will randomly sample from P” (because anything you do is cancelled by your evil twin). Thus, assuming you have control over the inner program, is equivalent to assuming that Omega sampled you. That is, yes correlation.
The Anthropic framing is also implicitly assuming “Omega samples you”, although it’s a bit more complicated since it also depends on your utility function (how much you care about different programs getting different amounts of money):
If your distribution is truly “Omega samples programs at random”, then when you observe two equal numbers, you are almost certain to be in a simulation. Given that, if you for example care equally about all programs getting as much money as possible, then of course you should one-box. That will entail that each random program (with a tiny probability) gets a million, which is a huge win. But the intuition that you were expressing in Question 2 (“p2 is better than p1 because it scores better”) isn’t compatible with “caring equally about all programs”. Instead, it sounds as if you positively want to score better than other programs, that is, maximize your score and minimize theirs! If that’s the case, then you obviously should two-box, since almost certainly you are subtracting a million from another program, not yourself. Even assuming, for simplicity, that you only care about beating one other program (as in the p1 and p2 example), you should two-box, because you are subtracting the same million dollars from both, but you are gaining a very slight edge with the thousand dollars.
If your distribution is instead “a = b” (assumes correlation), then, regardless of whether you want to maximize everyone’s payoff or you want to beat another program that could be sampled, you want to one-box, since the million dollar benefit is coming straight to you, and is bigger than the thousand dollar benefit.
No effect. I meant that programmer has to write b from P, not that b is added to P. Probably I should change the phrasing to make it clearer.
No, the utility here is just the amount of money b gets, whatever program it is.a doesn’t get any money, it just determines what will be in the first box.
I meant that it sounded like you “wanted a better average score (over as) when you are randomly sampled as b than other programs”. Although again I think the intuition-pumping is misleading here because the programmer is choosing which b to fix, but not which a to fix. So whether you wanna one-box only depends on whether you condition on a = b.
Somewhat mean caveat emptor for other readers: I just spent an hour trying to understand this post, and wish that I hadn’t. It’s still possible I’m missing the thing, but inside view is I’ve found the thing and the thing just isn’t that interesting.[1]
Feeding a program its Gödel numbering isn’t relevant (and doesn’t work?!), and the puzzle is over perhaps missing out on an unfathomably small amount of money[2].
By “unfathomably small” I mean ≈121010googol dollars. And, sure, there could be a deep puzzle there, but I feel that when a puzzle has its accidental complexity removed you usually can produce a more compelling use case.
As a function of M, |P| is very likely to be exponential and so it will take O(M) symbols to specify a member of P. Under many encodings, there isn’t one that can even check whether the inputs are equal before running out of time.
That aside, why are you assuming that program b “wants” anything? Essentially all of P won’t be programs that have any sort of “want”. If it is a precondition of the problem that b is such a program, what selection procedure is assumed between those that do “want” money from this scenario? Note that being selected for running is also a precondition for getting any money at all, so this selection procedure is critically important—far more so than anything the program might output!
O-ops, I didn’t think about it, thanks! Maybe it would be better to change it so input is “a=b” or “a!=b”, and a always gets “a=b”.
Programmer who wrote b decided that it should be consequentialist agent who wants to get money. (Or, if this program is actually, a, it wants to maximize the payment for b just because such a program was chosen by Omega by pure luck)
I’ll try to make it clearer:
Suppose b “knows” that Omega runs this experiment for all programs b. Then the optimal behaviour for a competent b (by a ridiculously small margin) is to 1-box.
Suppose b suspects that box-choosing programs are slightly less likely to be run if they 1-box on equal inputs. Then the optimal behaviour for b is to 2-box, because the average extra payoff for 1-boxing on equal inputs is utterly insignificant while the average penalty for not being chosen to run is very much greater. Anything that affects probability of being run as box-chooser with probability greater than 1000/|P| (which is on the order of 1/10^10^10^10^100) matters far more than what the program actually does.
In the original Newcombe problem, you know that you are going to get money based on your decision. In this problem, a running program does not know this. It doesn’t know whether it’s a or b or both, and every method for selecting a box-chooser is a different problem with different optimal strategies.
I’m not sure question 1 is analogous to Newcomb’s problem. We’re trying to maximize the money made by the program overall, not just the money made in the specific (x,x) case. In other words, even if you’re “inside” the problem, the UDT-ish thing to do is first figure out the optimal way to respond to all (x,y) pairs, and only then feed it the (x,x) pair. Wei Dai called this “UDT1.1″, first optimizing over all input-output maps and then applying it to a specific input.
Yes, that’s basically the same as what I mean by “Universal precommitment” framing. Weidness is in the fact that usually (I think, in all other decision-theoretic problems I ever encountered) “functional” and “anthropic” framings point in the same direction, but here they are not.
Wei’s motivating example for UDT1.1 is exactly that. It is indeed weird that Eliezer’s FDT paper doesn’t use the idea of optimizing over input-output maps, despite coming out later. But anyway, “folklore” (which is slowly being forgotten it seems) does know the proper way to handle this.
I don’t think “functional” and “anthropic” approaches are meaningful in this motivating example. There aren’t multiple instances of the same program with the same input.
Do you mean to ask how b should behave on input (n(b), n(b)), and how b should be written to behave on input (n(b), n(b)) for that b?
If x differs from n(b), it might matter in some subtle ways but not straightforwardly how b behaves on (x, x), because that never occurs explicitly in the actual thought experiment (where the first argument is always the code for the program itself). And if the programmer knows x before writing b, and x must be equal to n(b), then since n(-) is bijective, they don’t have any choice about how to write b other than to be the preimage of x under n(-).
Yes. You can assume that programmer doesn’t know how n works.
Not knowing n(-) results in not knowing expected utility of b (for any given b), because you won’t know how the terms a(n(a), n(a)) are formed.
(And also the whole being given numeric codes of programs as arguments thing gets weird when you are postulated to be unable to interpret what the codes mean. The point of Newcomblike problems is that you get to reason about behavior of specific agents.)
Basically you know if Omega’s program is the same as you or not (assuming you actually are b and not a)