Could you expand a little on why you say that no Bayesian hypothesis captures the distribution over robot-histories at different times? It seems like you can unroll an AMDP into a “memory MDP” that puts memory information of the robot into the state, thus allowing Bayesian calculation of the distribution over states in the memory MDP to capture history information in the AMDP.
I’m not sure what do you mean by that “unrolling”. Can you write a mathematical definition?
Let’s consider a simple example. There are two states: s0 and s1. There is just one action so we can ignore it.s0 is the initial state. An s0 robot transition into an s1 robot. An s1 robot transitions into an s0 robot and an s1 robot. How will our population look like?
0th step: all robots remember s0
1st step: all robots remember s0s1
2nd step: 1⁄2 of robots remember s0s1s0 and 1⁄2 of robots remember s0s1s1
3rd step: 1⁄3 of robots remembers s0s1s0s1, 1⁄3 of robots remember s0s1s1s0 and 1⁄3 of robots remember s0s1s1s1
There is no Bayesian hypothesis a robot can have that gives correct predictions both for step 2 and step 3. Indeed, to be consistent with step 2 we must have Pr[s0s1s0]=12 and Pr[s0s1s1]=12. But, to be consistent with step 3 we must have Pr[s0s1s0]=13, Pr[s0s1s1]=23.
In other words, there is no Bayesian hypothesis s.t. we can guarantee that a randomly sampled robot on a sufficiently late time step will have learned this hypothesis with high probability. The apparent transition probabilities keep shifting s.t. it might always continue to seem that the world is complicated enough to prevent our robot from having learned it already.
Or, at least it’s not obvious there is such a hypothesis. In this example, Pr[s0s1s1]Pr[s0s1s0] will converge to the golden ratio at late steps. But, do all probabilities converge fast enough for learning to happen, in general? I don’t know, maybe for finite state spaces it can work. Would definitely be interesting to check.
[EDIT: actually, in this example there is such a hypothesis but in general there isn’t, see below]
Great example. At least for the purposes of explaining what I mean :) The memory AMDP would just replace the states s0, s1 with the memory states [s0], [s1], [s0,s0], [s0,s1], etc. The action takes a robot in [s0] to memory state [s0,s1], and a robot in [s0,s1] to one robot in [s0,s1,s0] and another in [s0,s1,s1].
(Skip this paragraph unless the specifics of what’s going on aren’t obvious: given a transition distribution P(s′∗|s,π) (P being the distribution over sets of states s’* given starting state s and policy π), we can define the memory transition distribution P(s′∗m|sm,π) given policy π and starting “memory state” sm∈S∗ (Note that this star actually does mean finite sequences, sorry for notational ugliness). First we plug the last element of sm into the transition distribution as the current state. Then for each s′∗ in the domain, for each element in s′∗ we concatenate that element onto the end of sm and collect these s′m into a set s′∗m, which is assigned the same probability P(s′∗).)
So now at time t=2, if you sample a robot, the probability that its state begins with [s0,s1,s1] is 0.5. And at time t=3, if you sample a robot that probability changes to 0.66. This is the same result as for the regular MDP, it’s just that we’ve turned a question about the history of agents, which may be ill-defined, into a question about which states agents are in.
I’m still confused about what you mean by “Bayesian hypothesis” though. Do you mean a hypothesis that takes the form of a non-anthropic MDP?
I’m not quite sure what are you trying to say here, probably my explanation of the framework was lacking. The robots already remember the history, like in classical RL. The question about the histories is perfectly well-defined. In other words, we are already implicitly doing what you described. It’s like in classical RL theory, when you’re proving a regret bound or whatever, your probability space consists of histories.
I’m still confused about what you mean by “Bayesian hypothesis” though. Do you mean a hypothesis that takes the form of a non-anthropic MDP?
Yes, or a classical RL environment. Ofc if we allow infinite state spaces, then any environment can be regarded as an MDP (whose states are histories). That is, I’m talking about hypotheses which conform to the classical “cybernetic agent model”. If you wish, we can call it “Bayesian cybernetic hypothesis”.
Also, I want to clarify something I was myself confused about in the previous comment. For an anthropic Markov chain (when there is only one action) with a finite number of states, we can give a Bayesian cybernetic description, but for a general anthropic MDP we cannot even if the number of states is finite.
Indeed, consider some T:S→ΔNS. We can take its expected value to get ET:S→RS+. Assuming the chain is communicating, ET is an irreducible non-negative matrix, so by the Perron-Frobenius theorem it has a unique-up-to-scalar maximal eigenvector η∈RS+. We then get the subjective transition kernel:
ST(t∣s)=ET(t∣s)ηt∑t′∈SET(t′∣s)ηt′
Now, consider the following example of an AMDP. There are three actions A:={a,b,c} and two states S:={s0,s1}. When we apply a to an s0 robot, it creates two s0 robots, whereas when we apply a to an s1 robot, it leaves one s1 robot. When we apply b to an s1 robot, it creates two s1 robots, whereas when we apply b to an s0 robot, it leaves one s0 robot. When we apply c to any robot, it results in one robot whose state is s0 with probability 12 and s1 with probability 12.
Consider the following two policies.πa takes the sequence of actions cacaca… and πb takes the sequence of actions cbcbcb…. A population that follows πa would experience the subjective probability ST(s0∣s0,c)=23, whereas a population that follows πb would experience the subjective probability ST(s0∣s0,c)=13. Hence, subjective probabilities depend on future actions. So, effectively anthropics produces an acausal (Newcomb-like) environment. And, we already know such environments are learnable by infra-Bayesian RL agents and, (most probably) not learnable by Bayesian RL agents.
Ah, okay, I see what you mean. Like how preferences are divisible into “selfish” and “worldly” components, where the selfish component is what’s impacted by a future simulation of you that is about to have good things happen to it.
(edit: The reward function in AMDPs can either be analogous to “wordly” and just sum the reward calculated at individual timesteps, or analogous to “selfish” and calculated by taking the limit of the subjective distribution over parts of the history, then applying a reward function to the expected histories.)
I brought up the histories->states thing because I didn’t understand what you were getting at, so I was concerned that something unrealistic was going on. For example, if you assume that the agent can remember its history, how can you possibly handle an environment with memory-wiping?
In fact, to me the example is still somewhat murky, because you’re talking about the subjective probability of a state given a policy and a timestep, but if the agents know their histories there is no actual agent in the information-state that corresponds to having those probabilities. In an MDP the agents just have probabilities over transitions—so maybe a clearer example is an agent that copies itself if it wins the lottery having a larger subjective transition probability of going from gambling to winning. (i.e. states are losing and winning, actions are gamble and copy, the policy is to gamble until you win and then copy).
Ah, okay, I see what you mean. Like how preferences are divisible into “selfish” and “worldly” components, where the selfish component is what’s impacted by a future simulation of you that is about to have good things happen to it.
...I brought up the histories->states thing because I didn’t understand what you were getting at, so I was concerned that something unrealistic was going on. For example, if you assume that the agent can remember its history, how can you possibly handle an environment with memory-wiping?
AMDP is only a toy model that distills the core difficulty into more or less the simplest non-trivial framework. The rewards are “selfish”: there is a reward function r:(S×A)∗→R which allows assigning utilities to histories by time discounted summation, and we consider the expected utility of a random robot sampled from a late population. And, there is no memory wiping. To describe memory wiping we indeed need to do the “unrolling” you suggested. (Notice that from the cybernetic model POV, the history is only the remembered history.)
For a more complete framework, we can use an ontology chain, but (i) instead of A×O labels use A×M labels, where M is the set of possible memory states (a policy is then described by π:M→A), to allow for agents that don’t fully trust their memory (ii) consider another chain with a bigger state space S′ plus a mapping p:S′→NS s.t. the transition kernels are compatible. Here, the semantics of p(s) is: the multiset of ontological states resulting from interpreting the physical state s by taking the viewpoints of different agents s contains.
In fact, to me the example is still somewhat murky, because you’re talking about the subjective probability of a state given a policy and a timestep, but if the agents know their histories there is no actual agent in the information-state that corresponds to having those probabilities.
I didn’t understand “no actual agent in the information-state that corresponds to having those probabilities”. What does it mean to have an agent in the information-state?
Could you expand a little on why you say that no Bayesian hypothesis captures the distribution over robot-histories at different times? It seems like you can unroll an AMDP into a “memory MDP” that puts memory information of the robot into the state, thus allowing Bayesian calculation of the distribution over states in the memory MDP to capture history information in the AMDP.
I’m not sure what do you mean by that “unrolling”. Can you write a mathematical definition?
Let’s consider a simple example. There are two states: s0 and s1. There is just one action so we can ignore it.s0 is the initial state. An s0 robot transition into an s1 robot. An s1 robot transitions into an s0 robot and an s1 robot. How will our population look like?
0th step: all robots remember s0
1st step: all robots remember s0s1
2nd step: 1⁄2 of robots remember s0s1s0 and 1⁄2 of robots remember s0s1s1
3rd step: 1⁄3 of robots remembers s0s1s0s1, 1⁄3 of robots remember s0s1s1s0 and 1⁄3 of robots remember s0s1s1s1
There is no Bayesian hypothesis a robot can have that gives correct predictions both for step 2 and step 3. Indeed, to be consistent with step 2 we must have Pr[s0s1s0]=12 and Pr[s0s1s1]=12. But, to be consistent with step 3 we must have Pr[s0s1s0]=13, Pr[s0s1s1]=23.
In other words, there is no Bayesian hypothesis s.t. we can guarantee that a randomly sampled robot on a sufficiently late time step will have learned this hypothesis with high probability. The apparent transition probabilities keep shifting s.t. it might always continue to seem that the world is complicated enough to prevent our robot from having learned it already.
Or, at least it’s not obvious there is such a hypothesis. In this example, Pr[s0s1s1]Pr[s0s1s0] will converge to the golden ratio at late steps. But, do all probabilities converge fast enough for learning to happen, in general? I don’t know, maybe for finite state spaces it can work. Would definitely be interesting to check.
[EDIT: actually, in this example there is such a hypothesis but in general there isn’t, see below]
Great example. At least for the purposes of explaining what I mean :) The memory AMDP would just replace the states s0, s1 with the memory states [s0], [s1], [s0,s0], [s0,s1], etc. The action takes a robot in [s0] to memory state [s0,s1], and a robot in [s0,s1] to one robot in [s0,s1,s0] and another in [s0,s1,s1].
(Skip this paragraph unless the specifics of what’s going on aren’t obvious: given a transition distribution P(s′∗|s,π) (P being the distribution over sets of states s’* given starting state s and policy π), we can define the memory transition distribution P(s′∗m|sm,π) given policy π and starting “memory state” sm∈S∗ (Note that this star actually does mean finite sequences, sorry for notational ugliness). First we plug the last element of sm into the transition distribution as the current state. Then for each s′∗ in the domain, for each element in s′∗ we concatenate that element onto the end of sm and collect these s′m into a set s′∗m, which is assigned the same probability P(s′∗).)
So now at time t=2, if you sample a robot, the probability that its state begins with [s0,s1,s1] is 0.5. And at time t=3, if you sample a robot that probability changes to 0.66. This is the same result as for the regular MDP, it’s just that we’ve turned a question about the history of agents, which may be ill-defined, into a question about which states agents are in.
I’m still confused about what you mean by “Bayesian hypothesis” though. Do you mean a hypothesis that takes the form of a non-anthropic MDP?
I’m not quite sure what are you trying to say here, probably my explanation of the framework was lacking. The robots already remember the history, like in classical RL. The question about the histories is perfectly well-defined. In other words, we are already implicitly doing what you described. It’s like in classical RL theory, when you’re proving a regret bound or whatever, your probability space consists of histories.
Yes, or a classical RL environment. Ofc if we allow infinite state spaces, then any environment can be regarded as an MDP (whose states are histories). That is, I’m talking about hypotheses which conform to the classical “cybernetic agent model”. If you wish, we can call it “Bayesian cybernetic hypothesis”.
Also, I want to clarify something I was myself confused about in the previous comment. For an anthropic Markov chain (when there is only one action) with a finite number of states, we can give a Bayesian cybernetic description, but for a general anthropic MDP we cannot even if the number of states is finite.
Indeed, consider some T:S→ΔNS. We can take its expected value to get ET:S→RS+. Assuming the chain is communicating, ET is an irreducible non-negative matrix, so by the Perron-Frobenius theorem it has a unique-up-to-scalar maximal eigenvector η∈RS+. We then get the subjective transition kernel:
ST(t∣s)=ET(t∣s)ηt∑t′∈SET(t′∣s)ηt′
Now, consider the following example of an AMDP. There are three actions A:={a,b,c} and two states S:={s0,s1}. When we apply a to an s0 robot, it creates two s0 robots, whereas when we apply a to an s1 robot, it leaves one s1 robot. When we apply b to an s1 robot, it creates two s1 robots, whereas when we apply b to an s0 robot, it leaves one s0 robot. When we apply c to any robot, it results in one robot whose state is s0 with probability 12 and s1 with probability 12.
Consider the following two policies.πa takes the sequence of actions cacaca… and πb takes the sequence of actions cbcbcb…. A population that follows πa would experience the subjective probability ST(s0∣s0,c)=23, whereas a population that follows πb would experience the subjective probability ST(s0∣s0,c)=13. Hence, subjective probabilities depend on future actions. So, effectively anthropics produces an acausal (Newcomb-like) environment. And, we already know such environments are learnable by infra-Bayesian RL agents and, (most probably) not learnable by Bayesian RL agents.
Ah, okay, I see what you mean. Like how preferences are divisible into “selfish” and “worldly” components, where the selfish component is what’s impacted by a future simulation of you that is about to have good things happen to it.
(edit: The reward function in AMDPs can either be analogous to “wordly” and just sum the reward calculated at individual timesteps, or analogous to “selfish” and calculated by taking the limit of the subjective distribution over parts of the history, then applying a reward function to the expected histories.)
I brought up the histories->states thing because I didn’t understand what you were getting at, so I was concerned that something unrealistic was going on. For example, if you assume that the agent can remember its history, how can you possibly handle an environment with memory-wiping?
In fact, to me the example is still somewhat murky, because you’re talking about the subjective probability of a state given a policy and a timestep, but if the agents know their histories there is no actual agent in the information-state that corresponds to having those probabilities. In an MDP the agents just have probabilities over transitions—so maybe a clearer example is an agent that copies itself if it wins the lottery having a larger subjective transition probability of going from gambling to winning. (i.e. states are losing and winning, actions are gamble and copy, the policy is to gamble until you win and then copy).
AMDP is only a toy model that distills the core difficulty into more or less the simplest non-trivial framework. The rewards are “selfish”: there is a reward function r:(S×A)∗→R which allows assigning utilities to histories by time discounted summation, and we consider the expected utility of a random robot sampled from a late population. And, there is no memory wiping. To describe memory wiping we indeed need to do the “unrolling” you suggested. (Notice that from the cybernetic model POV, the history is only the remembered history.)
For a more complete framework, we can use an ontology chain, but (i) instead of A×O labels use A×M labels, where M is the set of possible memory states (a policy is then described by π:M→A), to allow for agents that don’t fully trust their memory (ii) consider another chain with a bigger state space S′ plus a mapping p:S′→NS s.t. the transition kernels are compatible. Here, the semantics of p(s) is: the multiset of ontological states resulting from interpreting the physical state s by taking the viewpoints of different agents s contains.
I didn’t understand “no actual agent in the information-state that corresponds to having those probabilities”. What does it mean to have an agent in the information-state?
Nevermind, I think I was just looking at it with the wrong class of reward function in mind.