Charlie Steiner comments on Vanessa Kosoy’s Shortform

Charlie Steiner 26 Dec 2020 8:50 UTC
LW: 2 AF: 1
AF
Could you expand a little on why you say that no Bayesian hypothesis captures the distribution over robot-histories at different times? It seems like you can unroll an AMDP into a “memory MDP” that puts memory information of the robot into the state, thus allowing Bayesian calculation of the distribution over states in the memory MDP to capture history information in the AMDP.
- Vanessa Kosoy 26 Dec 2020 16:31 UTC
  LW: 2 AF: 1
  AF Parent
  I’m not sure what do you mean by that “unrolling”. Can you write a mathematical definition?
  
  Let’s consider a simple example. There are two states: $s_{0}$ and $s_{1}$ . There is just one action so we can ignore it. $s_{0}$ is the initial state. An $s_{0}$ robot transition into an $s_{1}$ robot. An $s_{1}$ robot transitions into an $s_{0}$ robot and an $s_{1}$ robot. How will our population look like?
  
  0th step: all robots remember $s_{0}$
  
  1st step: all robots remember $s_{0} s_{1}$
  
  2nd step: ¹⁄₂ of robots remember $s_{0} s_{1} s_{0}$ and ¹⁄₂ of robots remember $s_{0} s_{1} s_{1}$
  
  3rd step: ¹⁄₃ of robots remembers $s_{0} s_{1} s_{0} s_{1}$ , ¹⁄₃ of robots remember $s_{0} s_{1} s_{1} s_{0}$ and ¹⁄₃ of robots remember $s_{0} s_{1} s_{1} s_{1}$
  
  There is no Bayesian hypothesis a robot can have that gives correct predictions both for step 2 and step 3. Indeed, to be consistent with step 2 we must have $P r [s_{0} s_{1} s_{0}] = \frac{1}{2}$ and $P r [s_{0} s_{1} s_{1}] = \frac{1}{2}$ . But, to be consistent with step 3 we must have $P r [s_{0} s_{1} s_{0}] = \frac{1}{3}$ , $P r [s_{0} s_{1} s_{1}] = \frac{2}{3}$ .
  
  In other words, there is no Bayesian hypothesis s.t. we can guarantee that a randomly sampled robot on a sufficiently late time step will have learned this hypothesis with high probability. The apparent transition probabilities keep shifting s.t. it might always continue to seem that the world is complicated enough to prevent our robot from having learned it already.
  
  Or, at least it’s not obvious there is such a hypothesis. In this example, $\frac{P r [s_{0} s_{1} s_{1}]}{P r [s_{0} s_{1} s_{0}]}$ will converge to the golden ratio at late steps. But, do all probabilities converge fast enough for learning to happen, in general? I don’t know, maybe for finite state spaces it can work. Would definitely be interesting to check.
  
  [EDIT: actually, in this example there is such a hypothesis but in general there isn’t, see below]
  - Charlie Steiner 27 Dec 2020 1:40 UTC
    LW: 2 AF: 1
    AF Parent
    Great example. At least for the purposes of explaining what I mean :) The memory AMDP would just replace the states $s_{0}$ , $s_{1}$ with the memory states $[s_{0}]$ , $[s_{1}]$ , $[s_{0}, s_{0}]$ , $[s_{0}, s_{1}]$ , etc. The action takes a robot in $[s_{0}]$ to memory state $[s_{0}, s_{1}]$ , and a robot in $[s_{0}, s_{1}]$ to one robot in $[s_{0}, s_{1}, s_{0}]$ and another in $[s_{0}, s_{1}, s_{1}]$ .
    (Skip this paragraph unless the specifics of what’s going on aren’t obvious: given a transition distribution $P (s^{' *} | s, π)$ (P being the distribution over sets of states s’* given starting state s and policy $π$ ), we can define the memory transition distribution $P (s_{m}^{' *} | s_{m}, π)$ given policy $π$ and starting “memory state” $s_{m} \in S^{*}$ (Note that this star actually does mean finite sequences, sorry for notational ugliness). First we plug the last element of $s_{m}$ into the transition distribution as the current state. Then for each $s^{' *}$ in the domain, for each element in $s^{' *}$ we concatenate that element onto the end of $s_{m}$ and collect these $s_{m}^{'}$ into a set $s_{m}^{' *}$ , which is assigned the same probability $P (s^{' *})$ .)
    So now at time t=2, if you sample a robot, the probability that its state begins with $[s_{0}, s_{1}, s_{1}]$ is 0.5. And at time t=3, if you sample a robot that probability changes to 0.66. This is the same result as for the regular MDP, it’s just that we’ve turned a question about the history of agents, which may be ill-defined, into a question about which states agents are in.
    I’m still confused about what you mean by “Bayesian hypothesis” though. Do you mean a hypothesis that takes the form of a non-anthropic MDP?
    - Vanessa Kosoy 27 Dec 2020 18:17 UTC
      LW: 2 AF: 1
      AF Parent
      I’m not quite sure what are you trying to say here, probably my explanation of the framework was lacking. The robots already remember the history, like in classical RL. The question about the histories is perfectly well-defined. In other words, we are already implicitly doing what you described. It’s like in classical RL theory, when you’re proving a regret bound or whatever, your probability space consists of histories.
      
      I’m still confused about what you mean by “Bayesian hypothesis” though. Do you mean a hypothesis that takes the form of a non-anthropic MDP?
      
      Yes, or a classical RL environment. Ofc if we allow infinite state spaces, then any environment can be regarded as an MDP (whose states are histories). That is, I’m talking about hypotheses which conform to the classical “cybernetic agent model”. If you wish, we can call it “Bayesian cybernetic hypothesis”.
      
      Also, I want to clarify something I was myself confused about in the previous comment. For an anthropic Markov chain (when there is only one action) with a finite number of states, we can give a Bayesian cybernetic description, but for a general anthropic MDP we cannot even if the number of states is finite.
      
      Indeed, consider some $T : S \to Δ N^{S}$ . We can take its expected value to get $E T : S \to R_{+}^{S}$ . Assuming the chain is communicating, $E T$ is an irreducible non-negative matrix, so by the Perron-Frobenius theorem it has a unique-up-to-scalar maximal eigenvector $η \in R_{+}^{S}$ . We then get the subjective transition kernel:
      
      $S T (t ∣ s) = \frac{E T (t ∣ s) η_{t}}{\sum_{t^{'} \in S} E T (t^{'} ∣ s) η_{t^{'}}}$
      
      Now, consider the following example of an AMDP. There are three actions $A := {a, b, c}$ and two states $S := {s_{0}, s_{1}}$ . When we apply $a$ to an $s_{0}$ robot, it creates two $s_{0}$ robots, whereas when we apply $a$ to an $s_{1}$ robot, it leaves one $s_{1}$ robot. When we apply $b$ to an $s_{1}$ robot, it creates two $s_{1}$ robots, whereas when we apply $b$ to an $s_{0}$ robot, it leaves one $s_{0}$ robot. When we apply $c$ to any robot, it results in one robot whose state is $s_{0}$ with probability $\frac{1}{2}$ and $s_{1}$ with probability $\frac{1}{2}$ .
      
      Consider the following two policies. $π_{a}$ takes the sequence of actions $c a c a c a \dots$ and $π_{b}$ takes the sequence of actions $c b c b c b \dots$ . A population that follows $π_{a}$ would experience the subjective probability $S T (s_{0} ∣ s_{0}, c) = \frac{2}{3}$ , whereas a population that follows $π_{b}$ would experience the subjective probability $S T (s_{0} ∣ s_{0}, c) = \frac{1}{3}$ . Hence, subjective probabilities depend on future actions. So, effectively anthropics produces an acausal (Newcomb-like) environment. And, we already know such environments are learnable by infra-Bayesian RL agents and, (most probably) not learnable by Bayesian RL agents.
      What links here?
      Vanessa Kosoy's comment on Vanessa Kosoy’s Shortform by Vanessa Kosoy (26 Dec 2020 16:31 UTC; 2 points)
      - Charlie Steiner 27 Dec 2020 21:52 UTC
        LW: 2 AF: 1
        AF Parent
        Ah, okay, I see what you mean. Like how preferences are divisible into “selfish” and “worldly” components, where the selfish component is what’s impacted by a future simulation of you that is about to have good things happen to it.
        
        (edit: The reward function in AMDPs can either be analogous to “wordly” and just sum the reward calculated at individual timesteps, or analogous to “selfish” and calculated by taking the limit of the subjective distribution over parts of the history, then applying a reward function to the expected histories.)
        
        I brought up the histories->states thing because I didn’t understand what you were getting at, so I was concerned that something unrealistic was going on. For example, if you assume that the agent can remember its history, how can you possibly handle an environment with memory-wiping?
        
        In fact, to me the example is still somewhat murky, because you’re talking about the subjective probability of a state given a policy and a timestep, but if the agents know their histories there is no actual agent in the information-state that corresponds to having those probabilities. In an MDP the agents just have probabilities over transitions—so maybe a clearer example is an agent that copies itself if it wins the lottery having a larger subjective transition probability of going from gambling to winning. (i.e. states are losing and winning, actions are gamble and copy, the policy is to gamble until you win and then copy).
        Vanessa Kosoy 28 Dec 2020 7:15 UTC
        LW: 2 AF: 1
        AF Parent
        
        Ah, okay, I see what you mean. Like how preferences are divisible into “selfish” and “worldly” components, where the selfish component is what’s impacted by a future simulation of you that is about to have good things happen to it.
        
        ...I brought up the histories->states thing because I didn’t understand what you were getting at, so I was concerned that something unrealistic was going on. For example, if you assume that the agent can remember its history, how can you possibly handle an environment with memory-wiping?
        
        AMDP is only a toy model that distills the core difficulty into more or less the simplest non-trivial framework. The rewards are “selfish”: there is a reward function $r : (S \times A)^{*} \to R$ which allows assigning utilities to histories by time discounted summation, and we consider the expected utility of a random robot sampled from a late population. And, there is no memory wiping. To describe memory wiping we indeed need to do the “unrolling” you suggested. (Notice that from the cybernetic model POV, the history is only the remembered history.)
        
        For a more complete framework, we can use an ontology chain, but (i) instead of $A \times O$ labels use $A \times M$ labels, where $M$ is the set of possible memory states (a policy is then described by $π : M \to A$ ), to allow for agents that don’t fully trust their memory (ii) consider another chain with a bigger state space $S^{'}$ plus a mapping $p : S^{'} \to N^{S}$ s.t. the transition kernels are compatible. Here, the semantics of $p (s)$ is: the multiset of ontological states resulting from interpreting the physical state $s$ by taking the viewpoints of different agents $s$ contains.
        
        In fact, to me the example is still somewhat murky, because you’re talking about the subjective probability of a state given a policy and a timestep, but if the agents know their histories there is no actual agent in the information-state that corresponds to having those probabilities.
        
        I didn’t understand “no actual agent in the information-state that corresponds to having those probabilities”. What does it mean to have an agent in the information-state?
        Charlie Steiner 29 Dec 2020 8:54 UTC
        LW: 2 AF: 1
        AF Parent
        
        What does it mean to have an agent in the information-state?
        
        Nevermind, I think I was just looking at it with the wrong class of reward function in mind.