Rohin Shah comments on Introduction To The Infra-Bayesianism Sequence

Rohin Shah 23 Mar 2021 0:08 UTC
LW: 13 AF: 7
AF
Planned summary for the Alignment Newsletter:
I have finally understood this sequence enough to write a summary about it, thanks to [AXRP Episode 5](https://www.alignmentforum.org/posts/FkMPXiomjGBjMfosg/axrp-episode-5-infra-bayesianism-with-vanessa-kosoy). Think of this as a combined summary + highlight of the sequence and the podcast episode.
The central problem of <@embedded agency@>(@Embedded Agents@) is that there is no clean separation between an agent and its environment: rather, the agent is _embedded_ in its environment, and so when reasoning about the environment it is reasoning about an entity that is “bigger” than it (and in particular, an entity that _contains_ it). We don’t have a good formalism that can account for this sort of reasoning. The standard Bayesian account requires the agent to have a space of precise hypotheses for the environment, but then the true hypothesis would also include a precise model of the agent itself, and it is usually not possible to have an agent contain a perfect model of itself.
A natural idea is to reduce the precision of hypotheses. Rather than requiring a hypothesis to assign a probability to every possible sequence of bits, we now allow the hypotheses to say “I have no clue about this aspect of this part of the environment, but I can assign probabilities to the rest of the environment”. The agent can then limit itself to hypotheses that don’t make predictions about the part of the environment that corresponds to the agent, but do make predictions about other parts of the environment.
Another way to think about it is that it allows you to start from the default of “I know nothing about the environment”, and then add in details that you do know to get an object that encodes the easily computable properties of the environment you can exploit, while not making any commitments about the rest of the environment.
Of course, so far this is just the idea of using [Knightian uncertainty](https://en.wikipedia.org/wiki/Knightian_uncertainty). The contribution of infra-Bayesianism is to show how to formally specify a decision procedure that uses Knightian uncertainty, while still satisfying many properties we would like a decision procedure to satisfy. You can thus think of it as an extension of the standard Bayesian account of decision making to the setting in which the agent cannot represent the true environment as a hypothesis over which it can reason.
Imagine that, instead of having a probability distribution over hypotheses, we instead have two “levels”: first are all the properties we have Knightian uncertainty over, and then are all the properties we can reason about. For example, imagine that the environment is an infinite sequence of bits, and we want to say that all the even bits come from flips of a possibly biased coin, but we know nothing about the odd coin flips. Then, at the top level, we have a separate branch for each possible setting of the odd coin flips. At the second level, we have a separate branch for each possible bias of the coin. At the leaves, we have the hypothesis “the odd bits are as set by the top level, and the even bits are generated from coin flips with the bias set by the second level”.
(Yes, there are lots of infinite quantities in this example, so you couldn’t implement it the way I’m describing it here. An actual implementation would not represent the top level explicitly and would use computable functions to represent the bottom level. We’re not going to worry about this for now.)
If we were using orthodox Bayesianism, we would put a probability distribution over the top level, and a probability distribution over the bottom level. You could then multiply that out to get a single probability distribution over the hypotheses, which is why we don’t do this separation into two levels in orthodox Bayesianism. (Also, just to reiterate, the _whole point_ is that we can’t put a probability distribution at the top level, since that implies e.g. making precise predictions about an environment that is bigger than you are.)
Infra-Bayesianism says, “what if we just… not put a probability distribution over the top level?” Instead, we have a set of probability distributions over hypotheses, and Knightian uncertainty over which distribution in this set is the right one. A common suggestion for Knightian uncertainty is to do _worst-case_ reasoning, so that’s what we’ll do at the top level. Lots of problems immediately crop up, but it turns out we can fix them.
First, let’s say your top level consists of two distributions over hypotheses, A and B. You then observe some evidence E, which A thought was 50% likely and B thought was 1% likely. Intuitively, you want to say that this makes A “more likely” relative to B than we previously thought. But how can you do this if you have Knightian uncertainty and are just planning to do worst-case reasoning over A and B? The solution here is to work with _unnormalized_ probability distributions at the second level. Then, in the case above, we can just scale the “probabilities” in both A and B by the likelihood assigned to E. We _don’t_ normalize A and B after doing this scaling.
But now what exactly do the numbers mean, if we’re going to leave these distributions unnormalized? Regular probabilities only really make sense if they sum to 1. We can take a different view on what a “probability distribution” is—instead of treating it as an object that tells you how _likely_ various hypotheses are, treat it as an object that tells you how much we _care_ about particular hypotheses. (See [related](https://www.lesswrong.com/posts/J7Gkz8aDxxSEQKXTN/what-are-probabilities-anyway) <@posts@>(@An Orthodox Case Against Utility Functions@).) So scaling down the “probability” of a hypothesis just means that we care less about what that hypothesis “wants” us to do.
This would be enough if we were going to take an average over A and B to make our final decision. However, our plan is to do worst-case reasoning at the top level. This interacts horribly with our current proposal: when we scale hypotheses in A by 0.5 on average, and hypotheses in B by 0.01 on average, the minimization at the top level is going to place _more_ weight on B, since B is now _more_ likely to be the worst case. Surely this is wrong?
What’s happening here is that B gets most of its expected utility in worlds where we observe different evidence, but the worst-case reasoning at the top level doesn’t take this into account. Before we update, since B assigned 1% to E, the expected utility of B is given by 0.99 * expected utility given not-E + 0.01 * expected utility given E. After the update, the second part remains but the first part disappears, which makes the worst-case reasoning wonky. So what we do is we keep track of the first part as well, and make sure that our worst-case reasoning takes it into account.
This gives us **infradistributions**: sets of (m, b) pairs, where m is an unnormalized probability distribution and b corresponds to “the value we would have gotten if we had seen different evidence”. When we observe some evidence E, the hypotheses within m are scaled by the likelihood they assign to E, and b is updated to include the value we would have gotten in the world where we saw anything other than E. Note that it is important to specify the utility function for this to make sense, as otherwise it is not clear how to update b. To compute utilities for decision-making, we do worst-case reasoning over the (m, b) pairs, where we use standard expected values within each m. We can prove that this update rule satisfies _dynamic consistency_: if initially you believe “if I see X, then I want to do Y”, then after seeing X, you believe “I want to do Y”.
So what can we do with infradistributions? Our original motivation was to talk about embedded agency, so a natural place to start is with decision theory problems in which the environment contains a perfect predictor of the agent, such as in Newcomb’s problem. Unfortunately, we can’t immediately write this down with infradistributions, because we have no way of (easily) formally representing “the environment perfectly predicts my actions”. One trick we can use is to consider hypotheses in which the environment just spits out some action, without the constraint that it must match the agent’s action. We then modify the utility function to give infinite utility when the prediction is incorrect. Since we do worst-case reasoning, the agent will effectively act as though this situation is impossible. With this trick, infra-Bayesianism performs similarly to UDT on a variety of challenging decision problems.
Planned opinion:
This seems pretty cool, though I don’t understand it that well yet. While I don’t yet feel like I have a better philosophical understanding of embedded agency (or its subproblems), I do think this is significant progress along that path.
In particular, one thing that feels a bit odd to me is the choice of worst-case reasoning for the top level—I don’t really see anything that _forces_ that to be the case. As far as I can tell we could get all the same results by using best-case reasoning instead (assuming we modified the other aspects appropriately). The obvious justification for worst-case reasoning is that it is a form of risk aversion, but it doesn’t feel like that is really sufficient—risk aversion in humans is pretty different from literal worst-case reasoning, and also none of the results in the post seem to depend on risk aversion.
I wonder whether the important thing is just that we don’t do expected value reasoning at the top level, and there are in fact a wide variety of other kinds of decision rules that we could use that could all work. If so, it seems interesting to characterize what makes some rules work while others don’t. I suspect that would be a more philosophically satisfying answer to “how should agents reason about environments that are bigger than them”.
What links here?
- Rohin Shah's comment on AXRP Episode 5 - Infra-Bayesianism with Vanessa Kosoy by DanielFilan (23 Mar 2021 0:08 UTC; 4 points)
- Vanessa Kosoy 24 Mar 2021 16:41 UTC
  LW: 5 AF: 2
  AF Parent
  
  The central problem of <@embedded agency@>(@Embedded Agents@) is that there is no clean separation between an agent and its environment...
  
  That’s certainly one way to motivate IB, however I’d like to note that even if there was a clean separation between an agent and its environment, it could still be the case that the environment cannot be precisely modeled by the agent due to its computational complexity (in particular this must be the case if the environment contains other agents of similar or greater complexity).
  
  The contribution of infra-Bayesianism is to show how to formally specify a decision procedure that uses Knightian uncertainty, while still satisfying many properties we would like a decision procedure to satisfy.
  
  Well, the use of Knightian uncertainty (imprecise probability) in decision theory certain appeared in the literature, so it would be more fair to say that the contribution of IB is combining that with reinforcement learning theory (i.e. treating sequential decision making and considering learnability and regret bounds in this setting) and applying that to various other questions (in particular, Newcombian paradoxes).
  
  In particular, one thing that feels a bit odd to me is the choice of worst-case reasoning for the top level—I don’t really see anything that forces that to be the case. As far as I can tell we could get all the same results by using best-case reasoning instead (assuming we modified the other aspects appropriately).
  
  The reason we use worst-case reasoning is because we want the agent to satisfy certain guarantees. Given a learnable class of infra-hypotheses, in the $γ \to 1$ limit, we can guarantee that whenever the true environment satisfies one of those hypotheses, the agent attains at least the corresponding amount of expected utility. You don’t get anything analogous with best-case reasoning.
  
  Moreover, there is an (unpublished) theorem showing that virtually any guarantee you might want to impose can be written in IB form. That is, let $E$ be the space of environments, and let $g_{n} : E \to [0, 1]$ be an increasing sequence of functions. We can interpret every $g_{n}$ as a requirement about the policy: $\forall μ : E_{μ π} [U] \geq g_{n} (μ)$ . These requirements become stronger with increasing $n$ . We might then want $π$ to be s.t. it satisfies the requirement with the highest $n$ possible. The theorem then says that (under some mild assumptions about the functions $g$ ) there exists an infra-environment s.t. optimizing for it is equivalent to maximizing $n$ . (We can replace $n$ by a continuous parameter, I made it discrete just for ease of exposition.)
  
  The obvious justification for worst-case reasoning is that it is a form of risk aversion, but it doesn’t feel like that is really sufficient—risk aversion in humans is pretty different from literal worst-case reasoning, and also none of the results in the post seem to depend on risk aversion.
  
  Actually it might be not that different. The Legendre-Fenchel duality shows you can think of infradistributions as just concave expectation functionals, which seems as a fairly general way to add risk-aversion to decision theory. It is also used in mathematical economics, see Peng.
  
  it seems interesting to characterize what makes some rules work while others don’t.
  
  Another rule which is tempting to use (and is known in the literature) is minimax-regret. However, it’s possible to show that if you allow your hypotheses to depend on the utility function then you can reduce it to ordinary maximin.
  What links here?
  - Vanessa Kosoy's comment on [AN #143]: How to make embedded agents that reason probabilistically about their environments by Rohin Shah (24 Mar 2021 17:51 UTC; 4 points)
  - Rohin Shah 24 Mar 2021 17:55 UTC
    LW: 4 AF: 3
    AF Parent
    I’d like to note that even if there was a clean separation between an agent and its environment, it could still be the case that the environment cannot be precisely modeled by the agent due to its computational complexity
    Yeah, agreed. I’m intentionally going for a simplified summary that sacrifices details like this for the sake of cleaner narrative.
    it would be more fair to say that the contribution of IB is combining that with reinforcement learning theory
    Ah, whoops. Live and learn.
    The reason we use worst-case reasoning is because we want the agent to satisfy certain guarantees. Given a learnable class of infra-hypotheses, in the γ→1
    limit, we can guarantee that whenever the true environment satisfies one of those hypotheses, the agent attains at least the corresponding amount of expected utility. You don’t get anything analogous with best-case reasoning.
    Okay, that part makes sense. Am I right though that in the case of e.g. Newcomb’s problem, if you use the anti-Nirvana trick (getting -infinity reward if the prediction is wrong), then you would still recover the same behavior (EDIT: if you also use best-case reasoning instead of worst-case reasoning)? (I think I was a bit too focused on the specific UDT / Nirvana trick ideas.)
    Actually it might be not that different. The Legendre-Fenchel duality shows you can think of infradistributions as just concave expectation functionals, which seems as a fairly general way to add risk-aversion to decision theory.
    Yeah… I’m a bit confused about this. If you imagine choosing any concave expectation functional, then I agree that can model basically any type of risk aversion. But it feels like your infra-distribution should “reflect reality” or something along those lines, which is an extra constraint. If there’s a “reflect reality” constraint and a “risk aversion” constraint and these are completely orthogonal, then it seems like you can’t necessarily satisfy both constraints at the same time.
    On the other hand, maybe if I thought about it for longer, I’d realize that the things we think of as “risk aversion” are actually identical to the “reflect reality” constraint when we are allowed to have Knightian uncertainty over some properties of the environment. In that case I would no longer have my objection.
    To be a bit more concrete: imagine that you know that the even bits in an infinite bitsequence come from a fair coin, but the odd bits come from some other agent, where you can’t model them exactly but you have some suspicion that they are a bit more likely to choose 1 over 0. Risk aversion might involve making a small bet that you’d see a 1 rather than a 0 in some specific odd bit (smaller than what EU maximization / Bayesian decision theory would recommend), but “reflecting reality” might recommend having Knightian uncertainty about the output of the agent which would mean never making a bet on the outputs of the odd bits.
    I am curious what happens in this scenario if you set the concave expectation functional based on the “risk aversion” setting above, and then use duality to get the “convex set of distributions” formulation—would the resulting object be meaningful to us?
    - Vanessa Kosoy 25 Mar 2021 20:42 UTC
      LW: 4 AF: 3
      AF Parent
      
      Am I right though that in the case of e.g. Newcomb’s problem, if you use the anti-Nirvana trick (getting -infinity reward if the prediction is wrong), then you would still recover the same behavior (EDIT: if you also use best-case reasoning instead of worst-case reasoning)?
      
      Yes
      
      imagine that you know that the even bits in an infinite bitsequence come from a fair coin, but the odd bits come from some other agent, where you can’t model them exactly but you have some suspicion that they are a bit more likely to choose 1 over 0. Risk aversion might involve making a small bet that you’d see a 1 rather than a 0 in some specific odd bit (smaller than what EU maximization / Bayesian decision theory would recommend), but “reflecting reality” might recommend having Knightian uncertainty about the output of the agent which would mean never making a bet on the outputs of the odd bits.
      
      I think that if you are offered a single bet, your utility is linear in money and your belief is a crisp infradistribution (i.e. a closed convex set of probability distributions) then it is always optimal to bet either as much as you can or nothing at all. But for more general infradistributions this need not be the case. For example, consider $X := {0, 1}$ and take the set of a-measures generated by $3 δ_{0}$ and $δ_{1}$ . Suppose you start with $\frac{1}{2}$ dollars and can bet any amount on any outcome at even odds. Then the optimal bet is betting $\frac{1}{4}$ dollars on the outcome $1$ , with a value of $\frac{3}{4}$ dollars.
      - Rohin Shah 25 Mar 2021 21:25 UTC
        LW: 2 AF: 2
        AF Parent
        But for more general infradistributions this need not be the case. For example, consider $X := {0, 1}$ and take the set of a-measures generated by $3 δ_{0}$ and $δ_{1}$ . Suppose you start with $\frac{1}{2}$ dollars and can bet any amount on any outcome at even odds. Then the optimal bet is betting $\frac{1}{4}$ dollars on the outcome $1$ , with a value of $\frac{3}{4}$ dollars.
        I guess my question is more like: shouldn’t there be some aspect of reality that determines what my set of a-measures is? It feels like here we’re finding a set of a-measures that rationalizes my behavior, as opposed to choosing a set of a-measures based on the “facts” of the situation and then seeing what behavior that implies.
        I feel like we agree on what the technical math says, and I’m confused about the philosophical implications. Maybe we should just leave the philosophy alone for a while.
        Vanessa Kosoy 29 Mar 2021 16:26 UTC
        LW: 4 AF: 3
        AF Parent
        IIUC your question can be reformulated as follows: a crisp infradistribution can be regarded as a claim about reality (the true distribution is inside the set), but it’s not clear how to generalize this to non-crisp. Well, if you think in terms of desiderata, then crisp says: if distribution is inside set then we have some lower bound on expected utility (and if it’s not then we don’t promise anything). On the other hand non-crisp gives a lower bound that is variable with the true distribution. We can think of non-crisp infradistirbutions as being fuzzy properties of the distribution (hence the name “crisp”). In fact, if we restrict ourselves to either of homogenous, cohomogenous or c-additive infradistributions, then we actually have a formal way to assign membership functions to infradistirbutions, i.e. literally regard them as fuzzy sets of distributions (which ofc have to satisfy some property analogous to convexity).
    - Diffractor 24 Mar 2021 19:17 UTC
      LW: 3 AF: 3
      AF Parent
      If you use the Anti-Nirvana trick, your agent just goes “nothing matters at all, the foe will mispredict and I’ll get -infinity reward” and rolls over and cries since all policies are optimal. Don’t do that one, it’s a bad idea.
      
      For the concave expectation functionals: Well, there’s another constraint or two, like monotonicity, but yeah, LF duality basically says that you can turn any (monotone) concave expectation functional into an inframeasure. Ie, all risk aversion can be interpreted as having radical uncertainty over some aspects of how the environment works and assuming you get worst-case outcomes from the parts you can’t predict.
      
      For your concrete example, that’s why you have multiple hypotheses that are learnable. Sure, one of your hypotheses might have complete knightian uncertainty over the odd bits, but another hypothesis might not. Betting on the odd bits is advised by a more-informative hypothesis, for sufficiently good bets. And the policy selected by the agent would probably be something like “bet on the odd bits occasionally, and if I keep losing those bets, stop betting”, as this wins in the hypothesis where some of the odd bits are predictable, and doesn’t lose too much in the hypothesis where the odd bits are completely unpredictable and out to make you lose.
      - Rohin Shah 24 Mar 2021 20:51 UTC
        LW: 2 AF: 2
        AF Parent
        If you use the Anti-Nirvana trick, your agent just goes “nothing matters at all, the foe will mispredict and I’ll get -infinity reward” and rolls over and cries since all policies are optimal. Don’t do that one, it’s a bad idea.
        Sorry, I meant the combination of best-case reasoning (sup instead of inf) and the anti-Nirvana trick. In that case the agent goes “Murphy won’t mispredict, since then I’d get -infinity reward which can’t be the best that I do”.
        For your concrete example, that’s why you have multiple hypotheses that are learnable.
        Hmm, that makes sense, I think? Perhaps I just haven’t really internalized the learning aspect of all of this.
- DanielFilan 23 Mar 2021 17:07 UTC
  LW: 2 AF: 1
  AF Parent
  One thing I realized after the podcast is that because the decision theory you get can only handle pseudo-causal environments, it’s basically trying to think about the statistics of environments rather than their internals. So my guess is that further progress on transparent newcomb is going to have to look like adding in the right kind of logical uncertainty or something. But basically it unsurprisingly has more of a statistical nature than what you imagine you want reading the FDT paper.
  - Vanessa Kosoy 24 Mar 2021 16:46 UTC
    LW: 2 AF: 1
    AF Parent
    
    it’s basically trying to think about the statistics of environments rather than their internals
    
    That’s not really true because the structure of infra-environments reflects the structure of those Newcombian scenarios. This means that the sample complexity of learning them will likely scale with their intrinsic complexity (e.g. some analogue of RVO dimension). This is different from treating the environment as a black-box and converging to optimal behavior by pure trial and error, which would yield much worse sample complexity.
    - DanielFilan 24 Mar 2021 20:43 UTC
      LW: 2 AF: 1
      AF Parent
      I agree that infra-bayesianism isn’t just thinking about sampling properties, and maybe ‘statistics’ is a bad word for that. But the failure on transparent Newcomb without kind of hacky changes to me suggests a focus on “what actions look good thru-out the probability distribution” rather than on “what logically-causes this program to succeed”.
      - Vanessa Kosoy 25 Mar 2021 20:49 UTC
        LW: 4 AF: 2
        AF Parent
        There is some truth in that, in the sense that, your beliefs must take a form that is learnable rather than just a god-given system of logical relationships.
      - Diffractor 25 Mar 2021 2:54 UTC
        LW: 3 AF: 2
        AF Parent
        There’s actually an upcoming post going into more detail on what the deal is with pseudocausal and acausal belief functions, among several other things, I can send you a draft if you want. “Belief Functions and Decision Theory” is a post that hasn’t held up nearly as well to time as “Basic Inframeasure Theory”.
        DanielFilan 25 Mar 2021 16:42 UTC
        LW: 2 AF: 1
        AF Parent
        Thanks for the offer, but I don’t think I have room for that right now.