Ah, okay, I see what you mean. Like how preferences are divisible into “selfish” and “worldly” components, where the selfish component is what’s impacted by a future simulation of you that is about to have good things happen to it.
(edit: The reward function in AMDPs can either be analogous to “wordly” and just sum the reward calculated at individual timesteps, or analogous to “selfish” and calculated by taking the limit of the subjective distribution over parts of the history, then applying a reward function to the expected histories.)
I brought up the histories->states thing because I didn’t understand what you were getting at, so I was concerned that something unrealistic was going on. For example, if you assume that the agent can remember its history, how can you possibly handle an environment with memory-wiping?
In fact, to me the example is still somewhat murky, because you’re talking about the subjective probability of a state given a policy and a timestep, but if the agents know their histories there is no actual agent in the information-state that corresponds to having those probabilities. In an MDP the agents just have probabilities over transitions—so maybe a clearer example is an agent that copies itself if it wins the lottery having a larger subjective transition probability of going from gambling to winning. (i.e. states are losing and winning, actions are gamble and copy, the policy is to gamble until you win and then copy).
Ah, okay, I see what you mean. Like how preferences are divisible into “selfish” and “worldly” components, where the selfish component is what’s impacted by a future simulation of you that is about to have good things happen to it.
...I brought up the histories->states thing because I didn’t understand what you were getting at, so I was concerned that something unrealistic was going on. For example, if you assume that the agent can remember its history, how can you possibly handle an environment with memory-wiping?
AMDP is only a toy model that distills the core difficulty into more or less the simplest non-trivial framework. The rewards are “selfish”: there is a reward function r:(S×A)∗→R which allows assigning utilities to histories by time discounted summation, and we consider the expected utility of a random robot sampled from a late population. And, there is no memory wiping. To describe memory wiping we indeed need to do the “unrolling” you suggested. (Notice that from the cybernetic model POV, the history is only the remembered history.)
For a more complete framework, we can use an ontology chain, but (i) instead of A×O labels use A×M labels, where M is the set of possible memory states (a policy is then described by π:M→A), to allow for agents that don’t fully trust their memory (ii) consider another chain with a bigger state space S′ plus a mapping p:S′→NS s.t. the transition kernels are compatible. Here, the semantics of p(s) is: the multiset of ontological states resulting from interpreting the physical state s by taking the viewpoints of different agents s contains.
In fact, to me the example is still somewhat murky, because you’re talking about the subjective probability of a state given a policy and a timestep, but if the agents know their histories there is no actual agent in the information-state that corresponds to having those probabilities.
I didn’t understand “no actual agent in the information-state that corresponds to having those probabilities”. What does it mean to have an agent in the information-state?
Ah, okay, I see what you mean. Like how preferences are divisible into “selfish” and “worldly” components, where the selfish component is what’s impacted by a future simulation of you that is about to have good things happen to it.
(edit: The reward function in AMDPs can either be analogous to “wordly” and just sum the reward calculated at individual timesteps, or analogous to “selfish” and calculated by taking the limit of the subjective distribution over parts of the history, then applying a reward function to the expected histories.)
I brought up the histories->states thing because I didn’t understand what you were getting at, so I was concerned that something unrealistic was going on. For example, if you assume that the agent can remember its history, how can you possibly handle an environment with memory-wiping?
In fact, to me the example is still somewhat murky, because you’re talking about the subjective probability of a state given a policy and a timestep, but if the agents know their histories there is no actual agent in the information-state that corresponds to having those probabilities. In an MDP the agents just have probabilities over transitions—so maybe a clearer example is an agent that copies itself if it wins the lottery having a larger subjective transition probability of going from gambling to winning. (i.e. states are losing and winning, actions are gamble and copy, the policy is to gamble until you win and then copy).
AMDP is only a toy model that distills the core difficulty into more or less the simplest non-trivial framework. The rewards are “selfish”: there is a reward function r:(S×A)∗→R which allows assigning utilities to histories by time discounted summation, and we consider the expected utility of a random robot sampled from a late population. And, there is no memory wiping. To describe memory wiping we indeed need to do the “unrolling” you suggested. (Notice that from the cybernetic model POV, the history is only the remembered history.)
For a more complete framework, we can use an ontology chain, but (i) instead of A×O labels use A×M labels, where M is the set of possible memory states (a policy is then described by π:M→A), to allow for agents that don’t fully trust their memory (ii) consider another chain with a bigger state space S′ plus a mapping p:S′→NS s.t. the transition kernels are compatible. Here, the semantics of p(s) is: the multiset of ontological states resulting from interpreting the physical state s by taking the viewpoints of different agents s contains.
I didn’t understand “no actual agent in the information-state that corresponds to having those probabilities”. What does it mean to have an agent in the information-state?
Nevermind, I think I was just looking at it with the wrong class of reward function in mind.