A lightning-fast recap of counterfactuals in decision theory: generally when facing a decision problem, a natural approach is to consider questions of the form “what would happen if I took action ?”. This particular construction has ambiguity in the use of the word “I” to refer to both the real and counterfactual agent, which is the root of the 5-and-10 problem. The Functional Decision Theory of Yudkowsky and Soares (2017) provides a more formal version of this construction, but suffers from (a more explicit version of) the same ambiguity by taking as its counterfactuals logical statements about its own decision function, some of which must necessarily be false. This poses an issue for an embedded agent implementing FDT, which, knowing itself to be part of its environment, might presumably therefore determine the full structure of its own decision function, and subsequently run into trouble when it can derive a contradiction from its counterlogical assumptions.
Soares and Levenstein (2017) explore this issue but defer it to future development of “a non-trivial theory of logical counterfactuals” (p. 1). I’ve been trying to explore a different approach: replacing the core counterfactual question of “what if my algorithm had a different output?” with “what if an agent with a different algorithm were in my situation?”. Yudkowsky and Soares (2017) consider a related approach, but dismiss it:
In attempting to avoid this dependency on counterpossible conditionals, one might suggest a variant FDT′ that asks not “What if my decision function had a different output?” but rather “What if I made my decisions using a different decision function?” When faced with a decision, an FDT′ agent would iterate over functions from some set , consider how much utility she would achieve if she implemented that function instead of her actual decision function, and emulate the best . Her actual decision function is the function that iterates over , and .
However, by considering the behavior of FDT′ in Newcomb’s problem, we see that it does not save us any trouble. For the predictor predicts the output of , and in order to preserve the desired correspondence between predicted actions and predictions, FDT′ cannot simply imagine a world in which she implements instead of ; she must imagine a world in which all predictors of predict as if behaved like —and then we are right back to where we started, with a need for some method of predicting how an algorithm would behave if (counterpossibly) behaved different from usual. (p. 7)
I think it should be possible avoid this objection by eliminating from consideration entirely. By only considering a separate agent that implements directly, we don’t need to imagine predictors of attempting to act as though were - we can just imagine predictors of instead.
More concretely, we can consider an embedded agent , with a world-model of type (a deliberately-vague container for everything the agent believes about the real world, constructed by building up from sense-data and axiomatic background assumptions by logical inference), and a decision function of type (interpreted as “given my current beliefs about the state of the world, what action should I take?”). Since is embedded, we can assume contains (as a background assumption) a description of , including the world-model (itself) and the decision function . For a particular possible action , we can simultaneously construct a counterfactual world-model and a counterfactual decision function such that maps to , and otherwise invokes (with some diagonalisation-adjacent trickery to handle the necessary self-references), and is the result of replacing all instances of in with . is a world-model similar to , but incrementally easier to predict, since instead of the root agent , it contains a counterfactual agent with world-model and decision function , which by construction immediately takes action . Given further a utility function of type , we can complete a description of as “return the that maximizes ”.
I have a large number of words of draft-thoughts exploring this idea in more detail, including a proper sketch of how to construct and , and an exploration of how it seems to handle various decision problems. In particular seems to give an interesting response to Newcomb-like problems (particularly the transparent variant), where it is forced to explicitly consider the possibility that it might be being simulated, and prescribes only a conditional one-boxing in the presence of various background beliefs about why it might be being simulated (and is capable of two-boxing without self-modification if it believes itself to be in a perverse universe where two-boxers predictably do better than one-boxers). But that seems mostly secondary to the intended benefit, which is that none of the reasoning about the counterfactual agents requires counterlogical assumptions—the belief that returns action needn’t be assumed, but can be inferred by inspection since is a fully defined function that could literally be written down if required.
Before I try to actually turn those words into something postable, I’d like to make sure (as far as I can) that it’s not an obviously flawed idea / hasn’t already been done before / isn’t otherwise uninteresting, by running it past the community here. Does anyone with expertise in this area have any comments on the viability of this approach?
I am somewhat somewhat new to this site and to the design problem of counter-factual reasoning in embedded agents. So I can say something about whether I think your approach goes in the right direction, but I can’t really answer your question if this has all been done before.
Based on my own reading and recent work so far, if your goal is to create a working “what happens if I do a” reasoning system for an embedded agent, my intuition is that your approach is necessary, and that it is going in the right direction. I will unpack this statement at length further below.
I also get the impression that you are wondering whether doing the proposed work will move forward the discussion about Newcomb’s problem. Not sure if this is really a part of your question, but in any case, I am not in a position to give you any good insights on what is needed to move forward the Newcomb discussion.
Here are my longer thoughts on your approach to building counterfactual reasoning. I need to start by setting the scene. Say we have an embedded agent containing a very advanced predictive world model, a model that includes all details about the agent’s current internal computational structure. Say this agent wants to compute the next action that will best maximize its utility function. The agent can perform this computation in two ways.
1. One way to pick the correct action is to initiate a simulation that runs its entire world model forward, and to observe the copy of itself inside the simulation. The action that its simulated copy picks is the action that it wants. There is an obvious problem with using this approach. If the simulated copy takes the same approach, and the agent it simulates in turn does likewise, and so on, we get infinite recursion and the predictive computation will never end. So at one point in the chain, the (real or n-level simulated) agent has to use a second way to compute the action that maximizes utility.
2. The second way is the argmax what-if method. Say the agent wants to choose between 10 possible actions. It can do this by running 10 simulations that compute ‘the utility of what will happen if I take action a[i]‘. Now, the agent knows full well that it will end up choosing only one of these 10 actions in its real future. So only one of these 10 simulations is a simulation of a future that actually contains the agent itself. It is therefore completely inappropriate for the agent to include any statement like ‘this future will contain myself’ inside at least 9 of the 10 world models that it supplies to its simulator. It has no choice but to ‘edit out’ any world model information that could cause the simulation engine to make this inference, or else 9 of the 10 simulations will have the 5-and-10 problem, and their results cannot be trusted anymore as valid inputs to its argmax calculation.
The above reasoning shows that the agent cannot avoid ending up in a place where it has to edit some details about itself out of the simulator input. (Exception: we could assume that the simulation engine somehow does the needed editing, just after getting the input inside its of the computation, but this amounts to basically the same thing. I interpret some proposals about updating FDT that I have read in the links as proposals to use such a special simulation engine.)
If the agent must do some editing on the simulator input, we can worry that the agent will do ‘too much’ editing, or the wrong type of editing. I have a theory that this worry is at least partly responsible for keeping the discussions around FDT and counterfactuals so lively.
The obvious worry is that, if the agent edits itself out of the simulation runs, it will no longer be able to reason about itself. In a philosophical sense, can we still say that such an agent possesses something like consciousness, or a robust theory of self? If not, does this mean that such an editing agent can never get to a state of super-intelligence? For me, these philosophical questions are really questions about the meaning of words, so I do not worry about AGI safety too much if these questions remain unresolved. However, editing also implies a more practical worry: how will editing impact the observable behaviour of the agent? If the agent edits itself out of the simulations, does this inevitably mean that it will lose some observable properties that we would expect to find in a highly intelligent agent, properties like showing an emergent drive to protect its own utility function?
When we look at the world models of many existing agents, e.g. the AlphaZero chess playing agent, we can interpret them as being ‘pre-edited’: these models do not contain any direct representation of the agent’s utility function or the hardware its runs on. Therefore, the simulation engine in the agent runs no risk of blowing up, no risk of creating a 5-and-10 contradiction when it combines available pieces of world knowledge. Some of the world knowledge needed to create the contradiction never made it into the world model. However, this type of construction also makes the a agent indifferent to protecting its utility function. I cover this topic in more detail in section 5.2 of my paper here. In section 5.3 I introduce an agent that uses a predictive world model that is ‘less edited’: this model makes the utility function of the agent show up inside the simulation runs. The pleasing result is that the agent then gets an emergent drive to preserve its utility function.
A similar agent design, that that turns the editing knob towards greater embeddedness, is in the paper Self-modification of policy and utility function in rational agents. Both these papers get pretty heavy on the math, when they start proving agent properties. This may be a price we have to pay for turning the knob. I am currently investigating how to turn the knob even closer to embeddedness. There some hints that the math might get simpler again in that case, but we’ll see.