From a learning-theoretic perspective, we can reformulate the problem of embedded agency as follows: What kind of agent, and in what conditions, can effectively plan for events after its own death? For example, Alice bequeaths eir fortune to eir children, since ey want them be happy even when Alice emself is no longer alive. Here, “death” can be understood to include modification, since modification is effectively destroying an agent and replacing it by different agent[1]. For example, Clippy 1.0 is an AI that values paperclips. Alice disabled Clippy 1.0 and reprogrammed it to value staples before running it again. Then, Clippy 2.0 can be considered to be a new, different agent.
First, in order to meaningfully plan for death, the agent’s reward function has to be defined in terms of something different than its direct perceptions. Indeed, by definition the agent no longer perceives anything after death. Instrumental reward functions are somewhat relevant but still don’t give the right object, since the reward is still tied to the agent’s actions and observations. Therefore, we will consider reward functions defined in terms of some fixed ontology of the external world. Formally, such an ontology can be an incomplete[2] Markov chain, the reward function being a function of the state. Examples:
The Markov chain is a representation of known physics (or some sector of known physics). The reward corresponds to the total mass of diamond in the world. To make this example work, we only need enough physics to be able to define diamonds. For example, we can make do with quantum electrodynamics + classical gravity and have the Knightian uncertainty account for all nuclear and high-energy phenomena.
The Markov chain is a representation of people and social interactions. The reward correspond to concepts like “happiness” or “friendship” et cetera. Everything that falls outside the domain of human interactions is accounted by Knightian uncertainty.
The Markov chain is Botworld with some of the rules left unspecified. The reward is the total number of a particular type of item.
Now we need to somehow connect the agent to the ontology. Essentially we need a way of drawing Cartesian boundaries inside the (a priori non-Cartesian) world. We can accomplish this by specifying a function that assigns an observation and projected action to every state out of some subset of states. Entering this subset corresponds to agent creation, and leaving it corresponds to agent destruction. For example, we can take the ontology to be Botworld + marked robot and the observations and actions be the observations and actions of that robot. If we don’t want marking a particular robot as part of the ontology, we can use a more complicated definition of Cartesian boundary that specifies a set of agents at each state plus the data needed to track these agents across time (in this case, the observation and action depend to some extent on the history and not only the current state). I will leave out the details for now.
Finally, we need to define the prior. To do this, we start by choosing some prior over refinements of the ontology. By “refinement”, I mean removing part of the Knightian uncertainty, i.e. considering incomplete hypotheses which are subsets of the “ontological belief”. For example, if the ontology is underspecified Botworld, the hypotheses will specify some of what was left underspecified. Given such a “objective” prior and a Cartesian boundary, we can construct a “subjective” prior for the corresponding agent. We transform each hypothesis via postulating that taking an action that differs from the projected action leads to “Nirvana” state. Alternatively, we can allow for stochastic action selection and use the gambler construction.
Does this framework guarantee effective planning for death? A positive answer would correspond to some kind of learnability result (regret bound). To get learnability, will first need that the reward is either directly on indirectly observable. By “indirectly observable” I mean something like with semi-instrumental reward functions, but accounting for agent mortality. I am not ready to formulate the precise condition atm. Second, we need to consider an asymptotic in which the agent is long lived (in addition to time discount being long-term), otherwise it won’t have enough time to learn. Third (this is the trickiest part), we need the Cartesian boundary to flow with the asymptotic as well, making the agent “unspecial”. For example, consider Botworld with some kind of simplicity prior. If I am a robot born at cell zero and time zero, then my death is an event of low description complexity. It is impossible to be confident about what happens after such a simple event, since there will always be competing hypotheses with different predictions and a probability that is only lower by a factor of Ω(1). On the other hand, if I am a robot born at cell 2439495 at time 9653302 then it would be surprising if the outcome of my death would be qualitatively different from the outcome of the death of any other robot I observed. Finding some natural, rigorous and general way to formalize this condition is a very interesting problem. Of course, even without learnability we can strive for Bayes-optimality or some approximation thereof. But, it is still important to prove learnability under certain conditions to test that this framework truly models rational reasoning about death.
Additionally, there is an intriguing connection between some of these ideas and UDT, if we consider TRL agents. Specifically, a TRL agent can have a reward function that is defined in terms of computations, exactly like UDT is often conceived. For example, we can consider an agent whose reward is defined in terms of a simulation of Botworld, or in terms of taking expected value over a simplicity prior over many versions of Botworld. Such an agent would be searching for copies of itself inside the computations it cares about, which may also be regarded as a form of “embeddedness”. It seems like this can be naturally considered a special case of the previous construction, if we allow the “ontological belief” to include beliefs pertaining to computations.
Some thoughts about embedded agency.
From a learning-theoretic perspective, we can reformulate the problem of embedded agency as follows: What kind of agent, and in what conditions, can effectively plan for events after its own death? For example, Alice bequeaths eir fortune to eir children, since ey want them be happy even when Alice emself is no longer alive. Here, “death” can be understood to include modification, since modification is effectively destroying an agent and replacing it by different agent[1]. For example, Clippy 1.0 is an AI that values paperclips. Alice disabled Clippy 1.0 and reprogrammed it to value staples before running it again. Then, Clippy 2.0 can be considered to be a new, different agent.
First, in order to meaningfully plan for death, the agent’s reward function has to be defined in terms of something different than its direct perceptions. Indeed, by definition the agent no longer perceives anything after death. Instrumental reward functions are somewhat relevant but still don’t give the right object, since the reward is still tied to the agent’s actions and observations. Therefore, we will consider reward functions defined in terms of some fixed ontology of the external world. Formally, such an ontology can be an incomplete[2] Markov chain, the reward function being a function of the state. Examples:
The Markov chain is a representation of known physics (or some sector of known physics). The reward corresponds to the total mass of diamond in the world. To make this example work, we only need enough physics to be able to define diamonds. For example, we can make do with quantum electrodynamics + classical gravity and have the Knightian uncertainty account for all nuclear and high-energy phenomena.
The Markov chain is a representation of people and social interactions. The reward correspond to concepts like “happiness” or “friendship” et cetera. Everything that falls outside the domain of human interactions is accounted by Knightian uncertainty.
The Markov chain is Botworld with some of the rules left unspecified. The reward is the total number of a particular type of item.
Now we need to somehow connect the agent to the ontology. Essentially we need a way of drawing Cartesian boundaries inside the (a priori non-Cartesian) world. We can accomplish this by specifying a function that assigns an observation and projected action to every state out of some subset of states. Entering this subset corresponds to agent creation, and leaving it corresponds to agent destruction. For example, we can take the ontology to be Botworld + marked robot and the observations and actions be the observations and actions of that robot. If we don’t want marking a particular robot as part of the ontology, we can use a more complicated definition of Cartesian boundary that specifies a set of agents at each state plus the data needed to track these agents across time (in this case, the observation and action depend to some extent on the history and not only the current state). I will leave out the details for now.
Finally, we need to define the prior. To do this, we start by choosing some prior over refinements of the ontology. By “refinement”, I mean removing part of the Knightian uncertainty, i.e. considering incomplete hypotheses which are subsets of the “ontological belief”. For example, if the ontology is underspecified Botworld, the hypotheses will specify some of what was left underspecified. Given such a “objective” prior and a Cartesian boundary, we can construct a “subjective” prior for the corresponding agent. We transform each hypothesis via postulating that taking an action that differs from the projected action leads to “Nirvana” state. Alternatively, we can allow for stochastic action selection and use the gambler construction.
Does this framework guarantee effective planning for death? A positive answer would correspond to some kind of learnability result (regret bound). To get learnability, will first need that the reward is either directly on indirectly observable. By “indirectly observable” I mean something like with semi-instrumental reward functions, but accounting for agent mortality. I am not ready to formulate the precise condition atm. Second, we need to consider an asymptotic in which the agent is long lived (in addition to time discount being long-term), otherwise it won’t have enough time to learn. Third (this is the trickiest part), we need the Cartesian boundary to flow with the asymptotic as well, making the agent “unspecial”. For example, consider Botworld with some kind of simplicity prior. If I am a robot born at cell zero and time zero, then my death is an event of low description complexity. It is impossible to be confident about what happens after such a simple event, since there will always be competing hypotheses with different predictions and a probability that is only lower by a factor of Ω(1). On the other hand, if I am a robot born at cell 2439495 at time 9653302 then it would be surprising if the outcome of my death would be qualitatively different from the outcome of the death of any other robot I observed. Finding some natural, rigorous and general way to formalize this condition is a very interesting problem. Of course, even without learnability we can strive for Bayes-optimality or some approximation thereof. But, it is still important to prove learnability under certain conditions to test that this framework truly models rational reasoning about death.
Additionally, there is an intriguing connection between some of these ideas and UDT, if we consider TRL agents. Specifically, a TRL agent can have a reward function that is defined in terms of computations, exactly like UDT is often conceived. For example, we can consider an agent whose reward is defined in terms of a simulation of Botworld, or in terms of taking expected value over a simplicity prior over many versions of Botworld. Such an agent would be searching for copies of itself inside the computations it cares about, which may also be regarded as a form of “embeddedness”. It seems like this can be naturally considered a special case of the previous construction, if we allow the “ontological belief” to include beliefs pertaining to computations.
Unless it’s some kind of modification that we treat explicitly in our model of the agent, for example a TRL agent reprogramming its own envelope.
“Incomplete” in the sense of Knightian uncertainty, like in quasi-Bayesian RL.