Vanessa Kosoy comments on Vanessa Kosoy’s Shortform

Vanessa Kosoy 26 Oct 2019 13:04 UTC
LW: 7 AF: 4
AF
I recently realized that the formalism of incomplete models provides a rather natural solution to all decision theory problems involving “Omega” (something that predicts the agent’s decisions). An incomplete hypothesis may be thought of a zero-sum game between the agent and an imaginary opponent (we will call the opponent “Murphy” as in Murphy’s law). If we assume that the agent cannot randomize against Omega, we need to use the deterministic version of the formalism. That is, an agent that learns an incomplete hypothesis converges to the corresponding maximin value in pure strategies. (The stochastic version can be regarded as a special case of the deterministic version where the agent has access to an external random number generator that is hidden from the rest of the environment according to the hypothesis.) To every decision problem, we can now correspond an incomplete hypothesis as follows. Every time Omega makes a prediction about the agent’s future action in some counterfactual, we have Murphy make a guess instead. This guess cannot be directly observed by the agent. If the relevant counterfactual is realized, then the agent’s action renders the guess false or true. If the guess is false, the agent receives infinite (or, sufficiently large) reward. If the guess is true, everything proceeds as usual. The maximin value then corresponds to the scenario where the guess is true and the agent behaves as if its action controls the guess. (Which is exactly what FDT and its variants try to achieve.)

For example, consider (repeated) counterfactual mugging. The incomplete hypothesis is a partially observable stochastic game (between the agent and Murphy), with the following states:
- $s_{0}$ : initial state. Murphy has two actions: $g_{+}$ (guess the agent will pay), transitioning to $s_{1 +}$ and $g_{-}$ (guess the agent won’t pay) transitioning to $s_{1 -}$ . (Reward = $0$ )
- $s_{1 +}$ : Murphy guessed the agent will pay. Transitions to $s_{2 a +}$ or $s_{2 b +}$ with probability $\frac{1}{2}$ to each (the coin flip). (Reward = $0$ )
- $s_{1 -}$ : Murphy guessed the agent won’t pay. Transitions to $s_{2 a -}$ or $s_{2 b -}$ with probability $\frac{1}{2}$ to each (the coin flip). (Reward = $0$ )
- $s_{2 a +}$ : Agent receives the prize. Transitions to $s_{3 u}$ . (Reward = $+ 1$ )
- $s_{2 b +}$ : Agent is asked for payment. Agent has two actions: $p_{+}$ (pay) transitioning to $s_{3 r +}$ and $p_{-}$ (don’t pay) transitioning to $s_{3 w -}$ . (Reward = $0$ )
- $s_{2 a -}$ : Agent receives nothing. Transitions to $s_{3 u}$ . (Reward = $0$ )
- $s_{2 b -}$ : Agent is asked for payment. Agent has two actions: $p_{+}$ (pay) transitioning to $s_{3 w +}$ and $p_{-}$ (don’t pay) transitioning to $s_{3 r -}$ . (Reward = $0$ )
- $s_{3 u}$ : Murphy’s guess remained untested. Transitions to $s_{0}$ . (Reward = $0$ )
- $s_{3 r +}$ : Murphy’s guess was right, agent paid. Transitions to $s_{0}$ . (Reward = $- 0.1$ )
- $s_{3 r -}$ : Murphy’s guess was right, agent didn’t pay. Transitions to $s_{0}$ . (Reward = $0$ )
- $s_{3 w +}$ : Murphy’s guess was wrong, agent paid. Transitions to $s_{0}$ . (Reward = $+ 1.9$ )
- $s_{3 w -}$ : Murphy’s guess was wrong, agent didn’t pay. Transitions to $s_{0}$ . (Reward = $+ 2$ )
The only percepts the agent receives are (i) the reward and (ii) whether it is asked for payment or not. The agent’s maximin policy is paying, since it guarantees an expected reward of $\frac{1}{2} \cdot 1 + \frac{1}{2} \cdot (- 0.1) = 0.45$ per round.

We can generalize this to an imperfect predictor (a predictor that sometimes makes mistakes), by using the same construction but adding noise to Murphy’s guess for purposes other than the guess’s correctness. Apparently, We can also generalize to the variant where the agent can randomize against Omega and Omega decides based on its predictions of the probabilities. This, however, is more complicated. In this variant there is no binary notion of “right” and “wrong” guess. Instead, we need to apply some statistical test to the guesses and compare it against a threshold. We can then consider a family of hypotheses with different thresholds, such that (i) with probability $1$ , for all but some finite number of thresholds, accurate guesses would never be judged wrong by the test (ii) with probability $1$ , consistently inaccurate guesses will be judged wrong by the test, with any threshold.

The same construction applies to logical counterfactual mugging, because the agent cannot distinguish between random and pseudorandom (by definition of pseudorandom). In TRL there would also be some family of programs the agent could execute s.t., according the hypothesis, their outputs are determined by the same “coin flips” as the offer to pay. However, this doesn’t change the optimal strategy: the “logical time of precommitment” is determined by the computing power of the “core” RL agent, without the computer “envelope”.
What links here?
- Nisan 7 Nov 2019 22:28 UTC
  LW: 6 AF: 3
  AF Parent
  My takeaway from this is that if we’re doing policy selection in an environment that contains predictors, instead of applying the counterfactual belief that the predictor is always right, we can assume that we get rewarded if the predictor is wrong, and then take maximin.
  
  How would you handle Agent Simulates Predictor? Is that what TRL is for?
  - Vanessa Kosoy 8 Nov 2019 16:34 UTC
    LW: 4 AF: 2
    AF Parent
    That’s about right. The key point is, “applying the counterfactual belief that the predictor is always right” is not really well-defined (that’s why people have been struggling with TDT/UDT/FDT for so long) while the thing I’m doing is perfectly well-defined. I describe agents that are able to learn which predictors exist in their environment and respond rationally (“rationally” according to the FDT philosophy).
    
    TRL is for many things to do with rational use of computational resources, such as (i) doing multi-level modelling in order to make optimal use of “thinking time” and “interacting with environment time” (i.e. simultaneously optimize sample and computational complexity) (ii) recursive self-improvement (iii) defending from non-Cartesian daemons (iv) preventing thought crimes. But, yes, it also provides a solution to ASP. TRL agents can learn whether it’s better to be predictable or predicting.
    - Chris_Leong 5 Dec 2019 15:18 UTC
      LW: 2 AF: 1
      AF Parent
      “The key point is, “applying the counterfactual belief that the predictor is always right” is not really well-defined”—What do you mean here?
      I’m curious whether you’re referring to the same as or similar to the issue I was referencing in Counterfactuals for Perfect Predictors. The TLDR is that I was worried that it would be inconsistent for an agent that never pays in Parfait’s Hitchhiker to end up in town if the predictor is perfect, so that it wouldn’t actually be well-defined what the predictor was predicting. And the way I ended up resolving this was by imagining it as an agent that takes input and asking what it would output if given that inconsistent input. But not sure if you were referencing this kind of concern or something else.
      - Vanessa Kosoy 5 Dec 2019 15:46 UTC
        LW: 6 AF: 2
        AF Parent
        It is not a mere “concern”, it’s the crux of problem really. What people in the AI alignment community have been trying to do is, starting with some factual and “objective” description of the universe (such a program or a mathematical formula) and deriving counterfactuals. The way it’s supposed to work is, the agent needs to locate all copies of itself or things “logically correlated” with itself (whatever that means) in the program, and imagine it is controlling this part. But a rigorous definition of this that solves all standard decision theoretic scenarios was never found.
        
        Instead of doing that, I suggest a solution of different nature. In quasi-Bayesian RL, the agent never arrives at a factual and objective description of the universe. Instead, it arrives at a subjective description which already includes counterfactuals. I then proceed to show that, in Newcomb-like scenarios, such agents receive optimal expected utility (i.e. the same expected utility promised by UDT).
        
        Chris_Leong 5 Dec 2019 16:01 UTC
        LW: 2 AF: 1
        AF Parent
        Yeah, I agree that the objective descriptions can leave out vital information, such as how the information you know was acquired, which seems important for determining the counterfactuals.
- cousin_it 13 Nov 2019 11:10 UTC
  LW: 2 AF: 1
  AF Parent
  But in Newcomb’s problem, the agent’s reward in case of wrong prediction is already defined. For example, if the agent one-boxes but the predictor predicted two-boxing, the reward should be zero. If you change that to +infinity, aren’t you open to the charge of formalizing the wrong problem?
  - Vanessa Kosoy 13 Nov 2019 13:23 UTC
    LW: 2 AF: 1
    AF Parent
    The point is, if you put this “quasi-Bayesian” agent into an iterated Newcomb-like problem, it will learn to get the maximal reward (i.e. the reward associated with FDT). So, if you’re judging it from the side, you will have to concede it behaves rationally, regardless of its internal representation of reality.
    
    Philosophically, my point of view is, it is an error to think that counterfactuals have objective, observer-independent, meaning. Instead, we can talk about some sort of consistency conditions between the different points of view. From the agent’s point of view, it would reach Nirvana if it dodged the predictor. From Omega’s point of view, if Omega two-boxed and the agent one-boxed, the agent’s reward would be zero (and the agent would learn its beliefs were wrong). From a third-person point of view, the counterfactual “Omega makes an error of prediction” is ill-defined, it’s conditioning on an event of probability 0.
    What links here?
    Vanessa Kosoy's comment on Vanessa Kosoy’s Shortform by Vanessa Kosoy (5 Jan 2020 16:54 UTC; 13 points)
    - cousin_it 13 Nov 2019 16:16 UTC
      LW: 2 AF: 1
      AF Parent
      Yeah, I think I can make peace with that. Another way to think of it is that we can keep the reward structure of the original Newcomb’s problem, but instead of saying “Omega is almost always right” we add another person Bob (maybe the mad scientist who built Omega) who’s willing to pay you a billion dollars if you prove Omega wrong. Then minimaxing indeed leads to one-boxing. Though I guess the remaining question is why minimaxing is the right thing to do. And if randomizing is allowed, the idea of Omega predicting how you’ll randomize seems a bit dodgy as well.
      - Vanessa Kosoy 30 Nov 2019 13:06 UTC
        LW: 4 AF: 3
        AF Parent
        Another explanation why maximin is a natural decision rule: when we apply maximin to fuzzy beliefs, the requirement to learn a particular class of fuzzy hypotheses is a very general way to formulate asymptotic performance desiderata for RL agents. So general that it seems to cover more or less anything you might want. Indeed, the definition directly leads to capturing any desideratum of the form
        
        $lim γ \to 1 E_{μ π_{γ}} [U (γ)] \geq f (μ)$
        
        Here, $f$ doesn’t have to be concave: the concavity condition in the definition of fuzzy beliefs is there because we can always assume it without loss of generality. This is because the left hand side in linear in $μ$ so any $π$ that satisfies this will also satisfy it for the concave hull of $f$ .
        
        What if instead of maximin we want to apply the minimax-regret decision rule? Then the desideratum is
        
        $lim γ \to 1 E_{μ π_{γ}} [U (γ)] \geq V (μ, γ) - f (μ)$
        
        But, it has the same form! Therefore we can consider it as a special case of the applying maximin (more precisely, it requires allowing the fuzzy belief to depend on $γ$ , but this is not a problem for the basics of the formalism).
        
        What if we want our policy to be at least as good as some fixed policy $π_{0}^{'}$ ? Then the desideratum is
        
        $lim γ \to 1 E_{μ π_{γ}} [U (γ)] \geq E_{μ π_{0}^{'}} [U (γ)]$
        
        It still has the same form!
        
        Moreover, the predictor/Nirvana trick allows us to generalize this to desiderata of the form:
        
        $lim γ \to 1 E_{μ π_{γ}} [U (γ)] \geq f (π, μ)$
        
        To achieve this, we postulate a predictor that guesses the policy, producing the guess $^π$ , and define the fuzzy belief using the function $E_{h \sim μ} [f (^π (h), μ)]$ (we assume the guess is not influenced by the agent’s actions so we don’t need $π$ in the expected value). Using Nirvana trick, we effectively force the guess to be accurate.
        
        In particular, this captures self-referential desiderata of the type “the policy cannot be improved by changing it in this particular way”. These are of the form:
        
        $lim γ \to 1 E_{μ π_{γ}} [U (γ)] \geq E_{μ F (π)} [U (γ)]$
        
        It also allows us to effectively restrict the policy space (e.g. impose computational resource constraints) by setting $f (π, μ)$ to $1$ for policies outside the space.
        
        The fact that quasi-Bayesian RL is so general can also be regarded as a drawback: the more general a framework the less information it contains, the less useful constraints it imposes. But, my perspective is that QBRL is the correct starting point, after which we need to start proving results about which fuzzy hypotheses classes are learnable, and within what sample/computational complexity. So, although QBRL in itself doesn’t impose much restrictions on what the agent should be, it provides the natural language in which desiderata should be formulated. In addition, we can already guess/postulate that an ideal rational agent should be a QBRL agent whose fuzzy prior is universal in some appropriate sense.
      - Vanessa Kosoy 13 Nov 2019 17:36 UTC
        LW: 2 AF: 1
        AF Parent
        Well, I think that maximin is the right thing to do because it leads to reasonable guarantees for quasi-Bayesian reinforcement learning agents. I think of incomplete models as properties that the environment might satisfy. It is necessary to speak of properties instead of complete models since the environment might be too complex to understand in full (for example because it contains Omega, but also for more prosaic reasons), but we can hope it at least has properties/patterns the agent can understand. A quasi-Bayesian agent has the guarantee that, whenever the environment satisfies one of the properties in its prior, the expected utility will converge at least to the maximin for this property. In other words, such an agent is able to exploit any true property of the environment it can understand. Maybe a more “philosophical” defense of maximin is possible, analogous to VNM / complete class theorems, but I don’t know (I actually saw some papers in that vein but haven’t read them in detail.)
        
        If the agent has random bits that Omega doesn’t see, and Omega is predicting the probabilities of the agent’s actions, then I think we can still solve it with quasi-Bayesian agents but it requires considering more complicated models and I haven’t worked out the details. Specifically, I think that we can define some function $X$ that depends on the agent’s actions and Omega’s predictions so far (a measure of Omega’s apparent inaccuracy), s.t. if Omega is an accurate predictor, then, the supremum of $X$ over time is finite with probability 1. Then, we consider consider a family of models, where model number $n$ says that $X < n$ for all times. Since at least one of these models is true, the agent will learn it, and will converge to behaving appropriately.
        
        EDIT 1: I think $X$ should be something like, how much money would a gambler following a particular strategy win, betting against Omega.
        
        EDIT 2: Here is the solution. In the case of original Newcomb, consider a gambler that bets against Omega on the agent one-boxing. Every time the agent two-boxes, the gambler loses $1$ dollar. Every time the agent one-boxes, the gambler wins $\frac{1}{p} - 1$ dollars, where $p$ is the probability Omega assigned to one-boxing. Now it’s possible to see that one-boxing guarantees the “CC” payoff under the corresponding model (in the $γ \to 1$ limit): If the agent one-boxes, the gambler keeps winning unless Omega converges to one-boxing rapidly enough. In the case of a general Newcomb-like problem, just replace “one-boxes” by “follows the FDT strategy”.
        
        What links here?
        Vanessa Kosoy's comment on Vanessa Kosoy’s Shortform by Vanessa Kosoy (5 Jan 2020 16:54 UTC; 13 points)
    - Linda Linsefors 13 Nov 2019 14:09 UTC
      LW: 1 AF: 1
      AF Parent
      I agree that you can assign what ever belief you want (e.g. what ever is useful for the agents decision making proses) for for what happens in the counterfactual when omega is wrong, in decision problems where Omega is assumed to be a perfect predictor. However if you want to generalise to cases where Omega is an imperfect predictor (as you do mention), then I think you will (in general) have to put in the correct reward for Omega being wrong, becasue this is something that might actually be observed.
      - Vanessa Kosoy 13 Nov 2019 14:13 UTC
        LW: 2 AF: 1
        AF Parent
        The method should work for imperfect predictors as well. In the simplest case, the agent can model the imperfect predictor as perfect predictor + random noise. So, it definitely knows the correct reward for Omega being wrong. It still believes in Nirvana if “idealized Omega” is wrong.