Weird Random Newcomb Problem

Epistemic status: I’m pretty sure the problem is somewhat interesting, because it temporarily confused several smart people. I’m not at all sure that it is very original; probably somebody has already thought about something similar. I’m not at all sure that I have actually found a flaw in UDT, but I somewhat expect that a discussion of this problem may clarify UDT for some people.

This post emerged from my work in the “Deconfusing Commitment Races” project under the Supervised Program for Alignment Research (SPAR), led by James Faville. I’m grateful to SPAR for providing the intellectual environment and to James Faville personally for intellectual discussions and help with the draft of this post. Any mistakes are my own.

I used Claude and Gemini to help me with phrasing and grammar in some parts of this post.

Scenario

Let be the set of all programs in a fixed programming language that satisfy the following conditions:

  • They take a pair of integers as input.

  • They output either the string “take 1 box” or “take 2 boxes”.

  • They always halt within steps, where is a very large constant (e.g., ).

  • They contain no unreachable code.

is a finite, albeit extremely large, set.

Let be some fixed bijective numbering of programs from .

Omega presents you a variation of Newcomb Problem.

  • Omega randomly selects a program uniformly from .

  • Omega runs with the input .

  • If outputs “take 1 box”, Omega puts $1,000,000 in the first box.

  • If outputs “take 2 boxes”, Omega leaves the first box empty.

  • As in the standard Newcomb Problem, Omega always puts $1,000 in the second box.

However, the program that decides whether you get the money from one or from both boxes is not (necessarily) . Let’s call this program . Program is also an element of . It receives the pair as input – its own number according to the numbering , and the number of the program that Omega randomly selected. Based on this input, must output either “take 1 box” or “take 2 boxes”.

Questions

Question 1: Assume you are program . You want to maximize the money you receive. What should you output if your input is (i.e., the two numbers are equal)?

Question 2: Assume you are the programmer writing program . You want to maximize the expected money program receives. How should you design b to behave when it receives an input ?

(Feel free to pause and consider these questions before reading further.)

-

-

-

-

Question 1 appears analogous to the standard Newcomb Problem. Omega ran your own code (acting as ) on the same input you received to determine whether to place the $1,000,000 in the first box. So it seems you should take one box.

But in Question 2 it’s better to write the program which always takes 2 boxes! Consider the programmer choosing between implementing b as one of two specific programs:

  • ​: Outputs “take 1 box” if the numbers in its input are equal, and “take 2 boxes” otherwise.

  • ​: Always outputs “take 2 boxes”, regardless of input.

gets additional $1,000 doesn’t get when , and gets the same payoff as in all other cases. The probability for to be any specific program is independent from . So is strictly better than .

So, if you are the program, you prefer to choose one action. But if you are the programmer who writes this program, you prefer it to choose another action in the same circumstances.

Isn’t it normal?

At first glance there are many problems like this. Justifications of advanced decision theories often use problems with this property. Usually their discussion ends with something like ”...and that’s why you should follow the optimal policy even if you didn’t explicitly precommit to it beforehand”. It follows the argument in one of the following framings:

“Universal precommitment” framing: You prefer to have an optimal policy. Sometimes the optimal policy includes locally non-optimal decisions (e.g., if someone predicts your policy). So you would like to make a precommitment for such cases. You can’t think about all possible situations in advance, so it’s better to make a precommitment “I will follow any precommitment which would be a good idea to make in advance”. It would be a good idea to precommit to take one box in the normal Newcomb Problem, to pay the driver in Parfit’s Hitchhiker, to pay Omega in Counterfactual Mugging. So you do it.

“Functional” framing: You control the input-output behavior of your decision-making function. This function can be instantiated not only in you, but in some other places, e.g., in someone who predicts your behavior. You necessarily control all instances at the same time. You prefer the instances in predictions to take one box/​to pay the driver/​to pay Omega. So you do it.

“Anthropic” framing: You actually don’t know if you are the “real you” or the simulation of yourself in the good prediction (otherwise it wouldn’t be good). So normal causal expected utility calculations tell you it’s better to take one box (it’s -$1K if you are “real you”, but +$1M if you are a simulation, and chances are 50-50)/​to pay the driver (-$100 and +life, chances of being a simulation are at least 50%)/​to pay Omega (+$500 and -$100, chances of being a simulation are 2 to 1).

No, it’s weird (I think)

As we see, these framings usually point in the same direction. But not here! When the program receives two equal numbers as an input, advice from these approaches is:

“Universal precommitment”: The optimal policy is to always take two boxes. Do it!

“Functional”: You control yourself and Omega’s program (because both are the same program), and it’s better for you if you both take one box. So leave $1K on the table to get a million!

“Anthropic”: You almost certainly are Omega’s program and not the player’s program. Take one simulated box to put $1M in the real one!

So here the “universal precommitment” approach is in conflict with the two other approaches. And for me personally, the advice to take both boxes (if you are a program; if you are a programmer, you definitely should write a program which will always take two boxes, no problem with that) here is much more counterintuitive than “universal precommitment” advice in all other decision theory problems I know. I think the “universal precommitment” framing is the closest to what UDT actually means, so now my confidence in it has been somewhat shaken.

I think it’s weird. Do you?