abramdemski comments on Radical Probabilism

abramdemski Aug 21, 2020, 5:38 PM
LW: 5 AF: 4
0
AF
It seems like you’re not engaging with the core argument, though—sure, we can understand everything as Bayesian updates, but why must we do so? Surely you agree that a classical, non-straw Bayesian understands Bayesian updates to be normative. The central argument here is just: why? The argument in favor of Bayesian updates is insufficient.
Why is a dogmatic Bayesian not allowed to update on virtual evidence? It seems like you (and Jeffries?)
Because P(A|B) is understood by the ratio formula, P(A&B)/P(B), and the terms P(A&B) and P(B) are understood in terms of the Kolmogorov probability axioms, which means that B is required to be an event (that is, a pre-existing event in the sigma algebra which underlies the probability distribution).
Straw man or no, the assumptions that P(A|B) is to be understood by the ratio formula is incredibly common, and the assumption that P(B) is to be understood as an event in a sigma algebra is incredibly common. Virtual evidence has to at least throw out one or the other; the most straightforward way of doing it throws out both.
Even if these were just common oversimplifications in textbooks, it would be worth pointing out. But I don’t think that’s the case—I think lots of Bayesians would assert these things. Even if they’ve read Pearl at some point. (I think classical Bayesians read Pearl’s section on virtual evidence and see it as a neat technical trick, but not a reason to really abandon fundamental assumptions about how updates work.)
It seems like you (and Jeffries?) have overly constrained the types of observations that a classical Bayesian is allowed to use, to essentially sensory stimuli.
I neglected to copy the thing where Jeffrey quotes prominent Bayesians to establish that they really believe these things, before ripping them to shreds.
It seems to me, talking to people, that the common Bayesian world-view includes stuff like: “any uncertain update like Jeffrey is talking about has to eventually ground out in evidence which is viewed with 100% certainty”; “formally, we have to adjust for observation bias by updating on the fact that we observed X rather than X alone; but ideal Bayesian agents never have to deal with this problem at the low level, because an agent is only ever updating on its observations—and an observation X is precisely the sort of thing which is equal to the-observation-of-X”; “what would you update on, if not your observations?”
This stuff isn’t necessarily explicitly written down or explicitly listed as part of the world-view, but (in my limited experience) it comes out when talking about tricky issues around the boundary. (And also there are some historical examples of Bayesians writing down such thoughts.)
The formal mathematics seems a little conflicted on this, since formal probability theory gives one the impression that one can update on any proposition, but information theory and some other important devices assume a stream of observations with a more restricted form.
I find your comment interesting because you seem to be suggesting a less radical interpretation of all this:
- Radical Probabilism throws out dogmatism of perception, rigidity, and empiricism (but can keep the ratio formula and the kolmogorov axioms—although I think Jeffrey would throw out both of those too). This opens up the space of possible updates to “almost anything”. I then offer virtual evidence as a reformulation which makes things feel more familiar, without really changing the fundamental conclusion that Bayesian updates are being abandoned.
- Your steelman Bayesianism instead takes the formalism of virtual evidence seriously, and so abandons the ratio formula and the Kolmogorov axioms, but keeps a version of dogmatism of perception, rigidity, and empiricism—with the consequence of keeping Bayesian updates as the fundamental update rule.
I’m fine with either of these approaches. The second seems more practical for the working Bayesian. The first seems more fundamental/foundational, since it rests simply on scratching out unjustified rationality assumptions (IE, Bayesian updates).
It seems like you are attacking a strawman, given that by your definition, Pearl isn’t a classical Bayesian.
I think Pearl has thought deeply about these things largely because Pearl has read Jeffrey.
Essentially, we update on A* instead of A. When we compute the likelihood P(A*|X), we can attempt to account for all the problems with our senses, neurons, memory, etc. that result in P(A*|~A) > 0.
The practical problem with this, in contrast to a more radical-probabilism approach, is that the probability distribution then has to explicitly model all of that stuff. As Scott and I discussed in Embedded World-Models, classical Bayesian models require the world to be in the hypothesis space (AKA realizability AKA grain of truth) in order to have good learning guarantees; so, in a sense, they require that the world is smaller than the probability distribution. Radical probabilism does not rest on this assumption for good learning properties.
But I don’t find this compelling. At some point there is a boundary to the machinery that’s performing the Bayesian update itself. If the message is being degraded after this point, then that means we’re no longer talking about a Bayesian updater.
I’m not sure what Jeffrey would say to this. I like his idea of doing away with this boundary. But I have not seen that he follows through, actually providing a way to do without drawing such a boundary.
- samshap Aug 22, 2020, 5:58 AM
  LW: 9 AF: 7
  0
  AF Parent
  I definitely missed a few things on the first read through—thanks for repeating the ratio argument in your response.
  I’m still confused about this statement:
  Virtual evidence requires probability functions to take arguments which aren’t part of the event space.
  Why can’t virtual evidence messages be part of the event space? Is it because they are continuously valued?
  As to why one would want to have Bayesian updates be normative: one answer is that they maximize our predictive power, given sufficient compute. Given the name of this website, that seems a sufficient reason.
  A second answer you hint at here:
  The second seems more practical for the working Bayesian.
  As a working Bayesian myself, having a practical update rule is quite useful! As far as I can tell, I don’t see a good alternative in what you have provided.
  Then we have to ask why not (steelmanned) classical Bayesianism? I think you’ve two arguments, one of which I buy, the other I don’t.
  The practical problem with this, in contrast to a more radical-probabilism approach, is that the probability distribution then has to explicitly model all of that stuff.
  This is the weak argument. Computing P(A*|X) “the likelihood I recall seeing A given X” is not a fundamentally different thing than modeling P(A|X) “the likelihood signal A happened given X”. You have to model an extra channel effect or two, but that’s just a difference of degree.
  Immediately after, though, you have the better argument:
  As Scott and I discussed in Embedded World-Models, classical Bayesian models require the world to be in the hypothesis space (AKA realizability AKA grain of truth) in order to have good learning guarantees; so, in a sense, they require that the world is smaller than the probability distribution. Radical probabilism does not rest on this assumption for good learning properties.
  if I were to paraphrase—Classical Bayesianism can fail entirely when the world state does not fit into one of its nonzero probability hypotheses, which must be of necessity limited in any realizable implementation.
  I find this pretty convincing. In my experience this is a problem that crops up quite frequently, and requires meta-Bayesian methods you mentioned like calibration (to notice you are confused) and generation of novel hypotheses.
  (Although Bayesianism is not completely dead here. If you reformulate your estimation problem to be over the hypothesis space and model space jointly, then Bayesian updates can get you the sort of probability shifts discussed in Pascal’s Muggle. Of course, you still run into the ‘limited compute’ problem, but in many cases it might be easier than attempting to cover the entire hypothesis space. Probably worth a whole other post by itself.)
  - abramdemski Aug 24, 2020, 2:50 PM
    LW: 6 AF: 4
    0
    AF Parent
    I definitely missed a few things on the first read through—thanks for repeating the ratio argument in your response.
    Oh, just FYI, I inserted that into the main post after writing about it in my response to you!
    - samshap Aug 25, 2020, 1:50 AM
      LW: 1 AF: 1
      0
      AF Parent
      Phew! Thanks for de-gaslighting me.
  - abramdemski Aug 24, 2020, 3:46 PM
    LW: 5 AF: 4
    0
    AF Parent
    Why can’t virtual evidence messages be part of the event space? Is it because they are continuously valued?
    Ah, now there’s a good question.
    You’re right, you could have an event in the event space which is just “the virtua-evidence update [such-and-such]”. I’m actually going to pull out this trick in a future follow-up post.
    I note that that’s not how Pearl or Jeffrey understand these updates. And it’s a peculiar thing to do—something happens to make you update a particular amount, but you’re just representing the event by the amount you update. Virtual evidence as-usually-understood at least coins a new symbol to represent the hard-to-articulate thing you’re updating on.
    But that’s not much of an argument against the idea, especially when virtual evidence is already a peculiar practice.
    Note that re-framing virtual evidence updates as Bayesian updates doesn’t change the fact that you’re now allowing essentially arbitrary updates rather than using Bayes’ Law to compute your update. You’re adjusting the formalism around an update to make Bayes’ Law true of it, not using Bayes’ Law to compute the update. So Bayes’ Law is still de-throned in the sense that it’s no longer the formula used to compute updates.
    As to why one would want to have Bayesian updates be normative: one answer is that they maximize our predictive power, given sufficient compute. Given the name of this website, that seems a sufficient reason.
    I’m going to focus on this even though you more or less concede the point later on in your reply. (By the way, I really appreciate your in-depth engagement with my position.)
    In what sense? What technical claim about Bayesian updates are you trying to refer to?
    One candidate is the dominance of Solomonoff induction over any computable machine learning technique. This has several problems:
    Solomonoff induction only gets you this guarantee in the sequential prediction setting, whereas logical induction gets a qualitatively similar guarantee in a broader setting.
    “sufficiently much computational power” is infinite, whereas the “sufficiently much computational power” for logical induction is finite. So logical induction would seem to get closer to offering real advice for bounded agents, rather than idealized decision theory whose application to bounded agents requires further insights.
    Thus, logical induction is more suited to handling logical uncertainty, which is critical for handling bounded agents. Logical uncertainty becomes less confusing when non-Bayesian updates are allowed.
    It can display irrational behaviors such as miscalibration, if grain-of-truth does not apply. These failures of rationality can be independently concerning, even given a nice guarantee about Bayes loss.
    Note that I did not put the point “Solomonoff’s dominance property requires grain-of-truth” in this list, because it doesn’t—in a Bayes Loss sense, Solomonoff still does almost as well as any computable machine learning technique even in non-realizable settings where it may not converge to any one computable hypothesis. (Even though better calibration does tend to improve Bayes loss, we know it isn’t losing too much to this.)
    Another candidate is that Bayes’ law is the optimal update policy when starting with a particular prior and trying to minimize Bayes loss. But again this has several problems.
    This is only true if the only information we have coming in is a sequence of propositions which we are updating 100% on. Realistically, we can revise the prior to be better just by thinking about it (due to our logical uncertainty), making non-Bayesian updates relevant.
    This optimality property only makes sense if we believe something like grain-of-truth. “I’ve listed all the possibilities in my prior, and I’m hedging against them according to my degree of belief in them. What more do you want from me?”
    Rationality isn’t just about minimizing Bayes loss. Bayes loss is a compelling property in part because it gets us other intuitively compelling properties (such as Bayes’ Law). But properties such as calibration and convergence also have intuitive appeal, and we can back this appeal up with Dutch Book and Dutch-Book-like arguments.
    A second answer you hint at here:
    >The second seems more practical for the working Bayesian.
    As a working Bayesian myself, having a practical update rule is quite useful! As far as I can tell, I don’t see a good alternative in what you have provided.
    Sadly, the actual machinery of logical induction was beyond the scope of this post, but there are answers. I just don’t yet know a good way to present it all as a nice, practical, intuitively appealing package.
    - samshap Aug 25, 2020, 3:51 AM
      LW: 1 AF: 1
      0
      AF Parent
      You’re right, you could have an event in the event space which is just “the virtua-evidence update [such-and-such]”. I’m actually going to pull out this trick in a future follow-up post.
      I note that that’s not how Pearl or Jeffrey understand these updates. And it’s a peculiar thing to do—something happens to make you update a particular amount, but you’re just representing the event by the amount you update. Virtual evidence as-usually-understood at least coins a new symbol to represent the hard-to-articulate thing you’re updating on.
      That’s not quite what I had in mind, but I can see how my ‘continuously valued’ comment might have thrown you off. A more concrete example might help: consider Example 2 in this paper. It posits three events:
      b—my house was burgled
      a—my alarm went off
      z—my neighbor calls to tell me the alarm went off
      Pearl’s method is to take what would be uncertain information about a (via my model of my neighbor and the fact she called me) and transform it into virtual evidence (which includes the likelihood ratio). What I’m saying is that you can just treat z as being an event itself, and do a Bayesian update from the likelihood P(z|b)=P(z|a)P(a|b)+P(z|~a)P(~a|b), etc. This will give you the exact same posterior as Pearl. Really, the only difference in these formulations is that Pearl only needs to know the ratio P(z|a):P(z|~a), whereas traditional Bayesian update requires actual values. Of course, any set of values consistent with the ratio will produce the right answer.
      The slightly more complex case (and why I mentioned continuous values) is in section 5 where the message includes probability data, such as a likelihood ratio. Note that the continuous value is not the amount you update (at least not generally), because its not generated from your own models, but rather by the messenger. Consider event z99, where my neighbor calls to say she’s 99% sure the alarm went off. This doesn’t mean I have to treat P(z99|b):P(z99|~b) as 99:1 - I might model my neighbor as being poorly calibrated (or as not being independent of other information I already have), and use some other ratio.
      In what sense? What technical claim about Bayesian updates are you trying to refer to?
      Definitely the second one, as optimal update policy. Responding to your specific objections:
      This is only true if the only information we have coming in is a sequence of propositions which we are updating 100% on.
      As you’ll hopefully agree with at this point, we can always manufacture the 100% condition by turning it into virtual evidence.
      This optimality property only makes sense if we believe something like grain-of-truth.
      I believe I previously conceded this point—the true hypothesis (or at least a ‘good enough’ one) must have a nonzero probability, which we can’t guarantee.
      But properties such as calibration and convergence also have intuitive appeal
      Re: calibration—I still believe that this can be included if you are jointly estimating your model and your hypothesis.
      Re: convergence—how real of a problem is this? In your example you had two hypotheses that were precisely equally wrong. Does convergence still fail if the true probability is 0.500001 ?
      (By the way, I really appreciate your in-depth engagement with my position.)
      Likewise! This has certainly been educational, especially in light of this:
      Sadly, the actual machinery of logical induction was beyond the scope of this post, but there are answers. I just don’t yet know a good way to present it all as a nice, practical, intuitively appealing package.
      The solution is too large to fit in the margins, eh? j/k, I know there’s a real paper. Should I go break my brain trying to read it, or wait for your explanation?
      - abramdemski Aug 30, 2020, 12:10 PM
        LW: 5 AF: 4
        0
        AF Parent
        The solution is too large to fit in the margins, eh? j/k, I know there’s a real paper. Should I go break my brain trying to read it, or wait for your explanation?
        Oh, I definitely don’t have a better explanation of that in the works at this point.
        Re: convergence—how real of a problem is this? In your example you had two hypotheses that were precisely equally wrong. Does convergence still fail if the true probability is 0.500001 ?
        My main concern here is if there’s an adversarial process taking advantage of a system, as in the trolling mathematicians work.
        In the case of mathematical reasoning, though, the problem is quite severe—as is hopefully clear from what I’ve linked above, although an adversary can greatly exacerbate the problem, even a normal non-adversarial stream of evidence is going to keep flipping the probability up and down by non-negligible amounts. (And although the post offers a solution, it’s a pretty dumb prior, and I argue all priors which avoid this problem will be similarly dumb.)
        Re: calibration—I still believe that this can be included if you are jointly estimating your model and your hypothesis.
        I don’t get this at all! What do you mean?
        As you’ll hopefully agree with at this point, we can always manufacture the 100% condition by turning it into virtual evidence.
        As I discussed earlier, I agree, but with the major caveat that the likelihoods from the update aren’t found by first determining what you’re updating on, and then looking up the likelihoods for that; instead, we have to determine the likelihoods and then update on the virtual-evidence proposition with those likelihoods.
        That’s not quite what I had in mind, but I can see how my ‘continuously valued’ comment might have thrown you off. A more concrete example might help: consider Example 2 in this paper. It posits three events:
        [...]
        What I’m saying is that you can just treat z as being an event itself, and do a Bayesian update from the likelihood [...]
        Hmmmm. Unfortunately I’m not sure what to say to this one except that in logical induction, there’s not generally a pre-existing z we can update on like that.
        Take the example of calibration traders. These guys can be described as moving probabilities up and down to account for the calibration curves so far. But that doesn’t really mean that you “update on the calibration trader” and e.g. move your 90% probabilities down to 89% in response. Instead, what happens is that the system takes a fixed-point, accounting for the calibration traders and also everything else, finding a point where all the various influences balance out. This point becomes the next overall belief distribution.
        So the actual update is a fixed-point calculation, which isn’t at all a nice formula such as multiplying all the forces pushing probabilities in different directions (finding the fixed point isn’t even a continuous function).
        We can make it into a Bayesian update on 100%-confident evidence by modeling it as virtual evidence, but the virtual evidence is sorta pulled from nowhere, just an arbitrary thing that gets us from the old belief state to the new one. The actual calculation of the new belief state is, as I said, the big fixed point operation.
        You can’t even say that we’re updating on all those forces pulling the distribution in different directions, because there are more than one fixed points of those forces. We don’t want to have uncertainty about which of those fixed points we end up in; that would give the wrong thing. So we really have to update on the fixed-point itself, which is already the answer to what to update to; not some information which we have pre-existing beliefs about and can figure out likelihood ratios for to figure out what to update to.
        So that’s my real crux, and any examples with telephone calls and earthquakes etc are merely illustrative for me. (Like I said, I don’t know how to actually motivate any of this stuff except with actual logical uncertainty, and I’m surprised that any philosophers would have become convinced just from other sorts of examples.)
        samshap Sep 6, 2020, 6:35 AM
        1 point
        0
        Parent
        Hmmmm. Unfortunately I’m not sure what to say to this one except that in logical induction, there’s not generally a pre-existing z we can update on like that.
        
        So that’s my real crux, and any examples with telephone calls and earthquakes etc are merely illustrative for me. (Like I said, I don’t know how to actually motivate any of this stuff except with actual logical uncertainty, and I’m surprised that any philosophers would have become convinced just from other sorts of examples.)
        I agree that the logical induction case is different, since it’s hard to conceive of likelihoods to begin with. Basically, logical induction doesn’t even include what I would call virtual evidence. But many of the examples you gave do have such a z. I think I agree with your crux, and my main critique here is just in the examples of overly dogmatic Bayesian who refuses to acknowledge the difference between a and z. I won’t belabor the point further.
        I’ve thought of another motivating example, BTW. In wartime, your enemy deliberately sends you some verifiably true information about their force dispositions. How should you update on that? You can’t use a Bayesian update, since you don’t actually have a likelihood model available. We can’t even attempt to learn a model from the information, since we can’t be sure its representative.
        I don’t get this at all! What do you mean?
        By model M, I mean an algorithm that generates likelihood functions, so M(H,Z) = P(Z|H).
        So any time we talk about a likelihood P(Z|H), it should really read P(Z|H,M). We’ll posit that P(H,M) = P(H)P(M) (i.e. that the model says nothing about our priors), but this isn’t strictly necessary.
        E(P(Z|H,M)) will be higher for a well calibrated model than a poorly calibrated model, which means that we expect P(H,M|Z) to also be higher. When we then marginalize over the models to get a final posterior on the hypothesis P(H|Z), it will be dominated by the well-calibrated models: P(H|Z) = SUM_i P(H|M_i,Z)P(M_i|Z).
        BTW, I had a chance to read part of the ILA paper. It barely broke my brain at all! I wonder if the trick of enumerating traders and incorporating them over time could be repurposed to a more Bayesianish context, by instead enumerating models M. Like the trading firm in ILA, a meta-Bayesian algorithm could keep introducing new models M_k over time, with some intuition that the calibration of the best model in the set would improve over time, perhaps giving it all those nice anti-dutch book properties. Basically this is a computable Solomonoff induction, that slowly approaches completeness in the limit. (I’m pretty sure this is not an original idea. I wouldn’t be surprised if something like this contributed to the ILA itself).
        Of course, its pretty unclear how this would work in the logical induction case. This might all be better explained in its own post.