Speaking for myself, this (combined with your orthodox case against utility functions) feels like the next biggest step for me since Embedded Agency in understanding what’s wrong with our models of agency and how to improve them.
If I were to put it into words, I’m getting a strong vibe of “No really, you’re starting the game inside the universe, stop assuming you’ve got all the hypotheses in your head and that you’ve got clean input-output, you need far fewer assumptions if you’re going to get around this space at all.” Plus a sense that this isn’t ‘weird’ or ‘impossibly confusing’, and that actually these things will be able to make good sense.
All the details though are in the things you say about convergence and not knowing your updates and so on, which I don’t have anything to add to.
I made notes while reading about things that I was confused about or that stood out to me. Here they are:
The post says that radical probabilism rejects #3-#5, but also that Jeffrey’s updates is derived from having rigidity (#5), which sounds like a contradiction. (I feel most dumb about this bullet, it’s probably obvious.)
The convergence section blew me away. The dialogue here correctly understood my confusion (why would I only believe either h(1/3) or h(2/3)) and then hit me with the ‘embedded world models’ point, and that was so persuasive. This felt really powerful, tying together some key arguments in this space.
I don’t get why the proof of conservation of expected evidence is relevant. It seems to assume that not only do I know how I will update, but that the bookie does too, which seems like an odd and overpowered assumption, and feels in contrast with all the things you said about rigidity – why does the bookie get to know how I’ll update?
“This has some implications for AI alignment, but I won’t try to spell them out here.” Such temptation! :)
I didn’t follow the argument that classical bayesians don’t have calibration. I think it’s just saying that classical bayesianism doesn’t have any part for self-reference, and that’s a big deal? I don’t think this means bayesians aren’t calibrated, just that they don’t have calibration as an explicit part of their model.
I do not understand how Jeffrey updates lead to path dependence. Is the trick that my probabilities can change without evidence, therefore I can just update B without observing anything that also updates A, and then use that for hocus pocus? Writing that out, I think that’s probably it, but as I was reading the essay I wasn’t sure which bit was where the key step was happening.
Okay, I got tired and skipped most of the virtual evidence section (it got tough for me). You say “Exchange Virtual Evidence” and I would be interested in a concrete example of what that kind of conversation would look like. I’m imagining it’s something like “I thought for ages and changed my mind, let me tell you why”.
Thanks for the the stuff at the end, about making the meta-bayesian update. I wanted to read you say your thoughts on that, would’ve been sad if it hadn’t been there.
The examples of non-bayesian updates I’ve been making are really valuable. I’ll be noticing these more often.
The post says that radical probabilism rejects #3-#5, but also that Jeffrey’s updates is derived from having rigidity (#5), which sounds like a contradiction.
Jeffrey doesn’t see Jeffrey updates as normative! Like Bayesian updates, they’re merely one possible way to update.
This is also part of why Pearl sounds like a critic of Jeffrey when in fact the two largely agree—you have to realize that Jeffrey isn’t advocating Jeffrey updating in a strong way, only using it as a kind of gateway drug to the more general fluid updates.
I don’t get why the proof of conservation of expected evidence is relevant. It seems to assume that not only do I know how I will update, but that the bookie does too, which seems like an odd and overpowered assumption, and feels in contrast with all the things you said about rigidity – why does the bookie get to know how I’ll update?
Hmm. It seems like a proper reply to this would be to step through the argument more carefully—maybe later? But no, the argument doesn’t require either of those. It requires only that you have some expectation about your update, and the bookie knows what that is (which is pretty standard, because in dutch book arguments the bookies generally have access to your beliefs). You might have a very broad distribution over your possible updates, but there will still be an expected value, which is what’s used in the argument.
I didn’t follow the argument that classical bayesians don’t have calibration. I think it’s just saying that classical bayesianism doesn’t have any part for self-reference, and that’s a big deal? I don’t think this means bayesians aren’t calibrated, just that they don’t have calibration as an explicit part of their model.
Like convergence, this is dependent on the prior, so I can’t say that classical Bayesians are never calibrated (although one could potentially prove some pretty strong negative results, as is the case with convergence?). I didn’t really include any argument, I just stated it as a fact.
What I can say is that classical Bayesianism doesn’t give you tools for getting calibrated. How do you construct a prior so that it’ll have a calibration property wrt learning? Classical Bayesianism doesn’t, to my knowledge, talk about this. Hence, by default, I expect most priors to be miscalibrated in practice when grain-of-truth (realizability) doesn’t hold.
For example, I’m not sure whether Solomonoff induction has a calibration property—nor whether it has a convergence property. These strike me as mathematically complex questions. What I do know is that the usual path to prove nice properties for Solomonoff induction doesn’t let you prove either of these things. (IE, we can’t just say “there’s a program in the mixture that’s calibrated/convergent, so....” … whereas logical induction lets you argue calibration and convergence via the relatively simple “there are traders which enforce these properties”)
I do not understand how Jeffrey updates lead to path dependence. Is the trick that my probabilities can change without evidence, therefore I can just update B without observing anything that also updates A, and then use that for hocus pocus? Writing that out, I think that’s probably it, but as I was reading the essay I wasn’t sure which bit was where the key step was happening.
hmmmm. My attempt at an English translation of my example:
A and B are correlated, so moving B to 60% (up from 50%) makes A more probable as well. But then moving A up to 60% is less of a move for A. This means that (A&¬B) ends up smaller than (B&¬A): both get dragged up and then down, but (B&¬A) was dragged up by the larger update and down by the smaller.
Okay, I got tired and skipped most of the virtual evidence section (it got tough for me). You say “Exchange Virtual Evidence” and I would be interested in a concrete example of what that kind of conversation would look like.
It would be nice to write a whole post on this, but the first thing you need to do is distinguish between likelihoods and probabilities.
likelihood(A|B)=probability(B|A)
The notation may look pointless at first. The main usage has to do with the way we usually regard the first argument as variable an the second as fixed. IE, “a probability function sums to one” can be understood as P(A|B)+P(¬A|B)=1; we more readily think of A as variable here. In a Bayesian update, we vary the hypothesis, not the evidence, so it’s more natural to think in terms of a likelihood function, L(H|E).
In a Bayesian network, you propagate probability functions down links, and likelihood functions up links. Hence Pearl distinguished between the two strongly.
Likelihood functions don’t sum to 1. Think of them as fragments of belief which aren’t meaningful on their own until they’re combined with a probability.
Base-rate neglect can be thought of as confusion of likelihood for probability. The conjunction fallacy could also be explained in this way.
I wish it were feasible to get people to use “likely” vs “probable” in this way. Sadly, that’s unprobable to work.
I’m imagining it’s something like “I thought for ages and changed my mind, let me tell you why”.
What I’m pointing at is really much more outside-view than that. Standard warnings about outside view apply. ;p
An example of exchanging probabilities is: I assert X, and another person agrees. I now know that they assign a high probability to X. But that does not tell me very much about how to update.
Exchanging likelihoods instead: I assert X, and the other person tells me they already thought that for unrelated reasons. This tells me that their agreement is further evidence for X, and I should update up.
Or, a different possibility: I assert X, and the other person updates to X, and tells me so. This doesn’t provide me with further evidence in favor of X, except insofar as they acted as a proof-checker for my argument.
“Exchange virtual evidence” just means “communicate likelihoods” (or just likelihood ratios!)
Exchanging likelihoods is better than exchanging probabilities, because likelihoods are much easier to update on.
Granted, exchanging models is much better than either of those two ;3 However, it’s not always feasible. There’s the quick conversational examples like I gave, where someone may just want to express their epistemic state wrt what you just said in a way which doesn’t interrupt the flow of conversation significantly. But we could also be in a position where we’re trying to integrate many expert opinions in a forecasting-like setting. If we can’t build a coherent model to fit all the information together, virtual evidence is probable to be one of the more practical and effective ways to go.
I do not understand how Jeffrey updates lead to path dependence. Is the trick that my probabilities can change without evidence, therefore I can just update B without observing anything that also updates A, and then use that for hocus pocus? Writing that out, I think that’s probably it, but as I was reading the essay I wasn’t sure which bit was where the key step was happening.
TL:DR;
Based on Radical Probabilism and Bayesian Conditioning (page 4 and page 5), the path depends on the order evidence is received in, but the destination does not.
An attractive feature of Jeffrey’s kinematics is that it allows one to be a fallibilist about evidence and yet still make use of it. An apparent sighting of one’s friend across the street, for instance, can be revised subsequently when you are told that he is out of the country. A closely related feature is the order-dependence of Jeffrey conditioning: conditioning on a particular redistribution of probability over a partition {Ai} and then on a redistribution of probability over another partition {Bi} will not in general yield the same posterior probability as conditioning first on the redistribution over {Bi} and 4 See Howson [8] for a full development of this point. A Bayesian might however take this as an argument against full belief in any contingent proposition. 4 then on that over {Ai}. This property, in contrast to the first, has been a matter of concern rather than admiration; a concern for the most part based on a confusion between the experience or evidence and its effect on the mind of the agent.5
Suppose, for instance, that I expect an essay from a student. I arrive at work to find an unnamed essay in my pigeonhole with familiar writing. I am 90% sure that it is from the student in question. But then I find that he left me a message the day before saying that he thinks that he may well not be able to bring me the essay in the next couple of days. In the light of all that I have learnt, I now lower to 30% my probability that the essay was from him. Suppose now I got the message before the essay. The final outcome should be the same, but I will get there a different way: perhaps by my probabilities for the essay coming from him initially going to 10% and then rising to 30% on finding the essay. The important thing is this reversal of the order of experience does not produce a reversal of the order of the probabilities: I do not think it 30% likely that I will get the essay after hearing the message and then revise it to 90% after checking my pigeonhole. The same experiences have different effects on my probabilities depending on the order in which they occur. (This is, of course, just a particular application of the rule that my posteriors depend both on the priors and the inputs).
Sh*t. Wow. This is really impressive.
Speaking for myself, this (combined with your orthodox case against utility functions) feels like the next biggest step for me since Embedded Agency in understanding what’s wrong with our models of agency and how to improve them.
If I were to put it into words, I’m getting a strong vibe of “No really, you’re starting the game inside the universe, stop assuming you’ve got all the hypotheses in your head and that you’ve got clean input-output, you need far fewer assumptions if you’re going to get around this space at all.” Plus a sense that this isn’t ‘weird’ or ‘impossibly confusing’, and that actually these things will be able to make good sense.
All the details though are in the things you say about convergence and not knowing your updates and so on, which I don’t have anything to add to.
I made notes while reading about things that I was confused about or that stood out to me. Here they are:
The post says that radical probabilism rejects #3-#5, but also that Jeffrey’s updates is derived from having rigidity (#5), which sounds like a contradiction. (I feel most dumb about this bullet, it’s probably obvious.)
The convergence section blew me away. The dialogue here correctly understood my confusion (why would I only believe either h(1/3) or h(2/3)) and then hit me with the ‘embedded world models’ point, and that was so persuasive. This felt really powerful, tying together some key arguments in this space.
I don’t get why the proof of conservation of expected evidence is relevant. It seems to assume that not only do I know how I will update, but that the bookie does too, which seems like an odd and overpowered assumption, and feels in contrast with all the things you said about rigidity – why does the bookie get to know how I’ll update?
“This has some implications for AI alignment, but I won’t try to spell them out here.” Such temptation! :)
I didn’t follow the argument that classical bayesians don’t have calibration. I think it’s just saying that classical bayesianism doesn’t have any part for self-reference, and that’s a big deal? I don’t think this means bayesians aren’t calibrated, just that they don’t have calibration as an explicit part of their model.
I do not understand how Jeffrey updates lead to path dependence. Is the trick that my probabilities can change without evidence, therefore I can just update B without observing anything that also updates A, and then use that for hocus pocus? Writing that out, I think that’s probably it, but as I was reading the essay I wasn’t sure which bit was where the key step was happening.
Okay, I got tired and skipped most of the virtual evidence section (it got tough for me). You say “Exchange Virtual Evidence” and I would be interested in a concrete example of what that kind of conversation would look like. I’m imagining it’s something like “I thought for ages and changed my mind, let me tell you why”.
Thanks for the the stuff at the end, about making the meta-bayesian update. I wanted to read you say your thoughts on that, would’ve been sad if it hadn’t been there.
The examples of non-bayesian updates I’ve been making are really valuable. I’ll be noticing these more often.
Jeffrey doesn’t see Jeffrey updates as normative! Like Bayesian updates, they’re merely one possible way to update.
This is also part of why Pearl sounds like a critic of Jeffrey when in fact the two largely agree—you have to realize that Jeffrey isn’t advocating Jeffrey updating in a strong way, only using it as a kind of gateway drug to the more general fluid updates.
Hmm. It seems like a proper reply to this would be to step through the argument more carefully—maybe later? But no, the argument doesn’t require either of those. It requires only that you have some expectation about your update, and the bookie knows what that is (which is pretty standard, because in dutch book arguments the bookies generally have access to your beliefs). You might have a very broad distribution over your possible updates, but there will still be an expected value, which is what’s used in the argument.
Like convergence, this is dependent on the prior, so I can’t say that classical Bayesians are never calibrated (although one could potentially prove some pretty strong negative results, as is the case with convergence?). I didn’t really include any argument, I just stated it as a fact.
What I can say is that classical Bayesianism doesn’t give you tools for getting calibrated. How do you construct a prior so that it’ll have a calibration property wrt learning? Classical Bayesianism doesn’t, to my knowledge, talk about this. Hence, by default, I expect most priors to be miscalibrated in practice when grain-of-truth (realizability) doesn’t hold.
For example, I’m not sure whether Solomonoff induction has a calibration property—nor whether it has a convergence property. These strike me as mathematically complex questions. What I do know is that the usual path to prove nice properties for Solomonoff induction doesn’t let you prove either of these things. (IE, we can’t just say “there’s a program in the mixture that’s calibrated/convergent, so....” … whereas logical induction lets you argue calibration and convergence via the relatively simple “there are traders which enforce these properties”)
Thank you, those points all helped a bunch.
(I feel most resolved on the calibration one. If I think more about the other two and have more questions, I’ll come back and write them.)
hmmmm. My attempt at an English translation of my example:
A and B are correlated, so moving B to 60% (up from 50%) makes A more probable as well. But then moving A up to 60% is less of a move for A. This means that (A&¬B) ends up smaller than (B&¬A): both get dragged up and then down, but (B&¬A) was dragged up by the larger update and down by the smaller.
It would be nice to write a whole post on this, but the first thing you need to do is distinguish between likelihoods and probabilities.
likelihood(A|B)=probability(B|A)
The notation may look pointless at first. The main usage has to do with the way we usually regard the first argument as variable an the second as fixed. IE, “a probability function sums to one” can be understood as P(A|B)+P(¬A|B)=1; we more readily think of A as variable here. In a Bayesian update, we vary the hypothesis, not the evidence, so it’s more natural to think in terms of a likelihood function, L(H|E).
In a Bayesian network, you propagate probability functions down links, and likelihood functions up links. Hence Pearl distinguished between the two strongly.
Likelihood functions don’t sum to 1. Think of them as fragments of belief which aren’t meaningful on their own until they’re combined with a probability.
Base-rate neglect can be thought of as confusion of likelihood for probability. The conjunction fallacy could also be explained in this way.
I wish it were feasible to get people to use “likely” vs “probable” in this way. Sadly, that’s unprobable to work.
What I’m pointing at is really much more outside-view than that. Standard warnings about outside view apply. ;p
An example of exchanging probabilities is: I assert X, and another person agrees. I now know that they assign a high probability to X. But that does not tell me very much about how to update.
Exchanging likelihoods instead: I assert X, and the other person tells me they already thought that for unrelated reasons. This tells me that their agreement is further evidence for X, and I should update up.
Or, a different possibility: I assert X, and the other person updates to X, and tells me so. This doesn’t provide me with further evidence in favor of X, except insofar as they acted as a proof-checker for my argument.
“Exchange virtual evidence” just means “communicate likelihoods” (or just likelihood ratios!)
Exchanging likelihoods is better than exchanging probabilities, because likelihoods are much easier to update on.
Granted, exchanging models is much better than either of those two ;3 However, it’s not always feasible. There’s the quick conversational examples like I gave, where someone may just want to express their epistemic state wrt what you just said in a way which doesn’t interrupt the flow of conversation significantly. But we could also be in a position where we’re trying to integrate many expert opinions in a forecasting-like setting. If we can’t build a coherent model to fit all the information together, virtual evidence is probable to be one of the more practical and effective ways to go.
Thank you, they were all helpful. I’ll write more if I have more questions.
(“sadly that’s unprobable to work” lol)
TL:DR;
Based on Radical Probabilism and Bayesian Conditioning (page 4 and page 5), the path depends on the order evidence is received in, but the destination does not.
From the text itself:
The “issue” is mentioned:
And explained: