This seems extremely pertinent for LW: a paper by Andrew Gelman and Cosma Shalizi. Abstract:
A substantial school in the philosophy of science identifies Bayesian inference with inductive inference and even rationality as such, and seems to be strengthened by the rise and practical success of Bayesian statistics. We argue that the most successful forms of Bayesian statistics do not actually support that particular philosophy but rather accord much better with sophisticated forms of hypothetico-deductivism. We examine the actual role played by prior distributions in Bayesian models, and the crucial aspects of model checking and model revision, which fall outside the scope of Bayesian confirmation theory. We draw on the literature on the consistency of Bayesian updating and also on our experience of applied work in social science.
I’m still reading it so I don’t have anything to say about it, and I’m not very statistics-savvy so I doubt I’ll have much to say about it after I read it, but I thought others here would find it an interesting read.
I stole this from a post by mjgeddes over in the OB open thread for July (Aside: mjgeddes, why all the hate? Where’s the love, brotha?)
I wrote a backlink to here from OB. I am not yet expert enough to do an evaluation of this. I do think however that it is an important and interesting question that mjgeddes asks. As an active (although at a low level) rationalist I think it is important to try to at least to some extent follow what expert philosophers of science actually find out of how we can obtain reasonably reliable knowledge. The dominating theory of how science proceeds seems to be the hypothetico-deductive model, somewhat informally described. No formalised model for the scientific process seems so far has been able to answer to serious criticism of in the philosophy of science community. “Bayesianism” seems to be a serious candidate for such a formalised model but seems still to be developed further if it should be able to anser all serious criticism. The recent article by Gelman and Shalizi is of course just the latest in a tradition of bayesian-critique. A classic article is Glymour “Why I am Not a Bayesian” (also in the reference list of Gelman and Shalizi). That is from 1980 so probably a lot has happened since then. I myself am not up-to-date with most of development, but it seems to be an import topic to discuss here on Less Wrong that seems to be quite bayesianistically oriented.
mjgeddes was once publicly dissed by Eliezer Yudkowsky on OB (can’t find the link now, but it was a pretty harsh display of contempt). Since then, he has often bashed Bayesian induction, presumably in an effort to undercut EY’s world view and thereby hurt EY as badly as he himself was hurt.
No, not that. Geddes made a comment on OB about eating a meal with EY during which he made some well-meaning remark about EY becoming more like Geddes as EY grows older, and noticing an expression of contempt (if memory serves) on EY’s face. EY’s reply on OB made it clear that he had zero esteem for Geddes.
But I know Shalizi is skeptical of Bayesians and some of his blog posts seem so directly targeted at the LessWrong point of view that I almost suspect he’s read this stuff. Getting in contact with him would be a coup.
Yesterday, I posted my thoughts in last month’s thread on the article. I’m reproducing them here since this is where the discussion is at:
[cousin_it summarizing Gelman’s position] See, after locating the hypothesis, we can run some simple statistical checks on the hypothesis and the data to see if our prior was wrong. For example, plot the data as a histogram, and plot the hypothesis as another histogram, and if there’s a lot of data and the two histograms are wildly different, we know almost for certain that the prior was wrong. As a responsible scientist, I’d do this kind of check. The catch is, a perfect Bayesian wouldn’t. The question is, why?
Model checking is completely compatible with “perfect Bayesianism.” In the practice of Bayesian statistics, how often is the prior distribution you use exactly the same as your actual prior distribution? The answer is never. Really, do you think your actual prior follows a gamma distribution exactly? The prior distribution you use in the computation is a model of your actual prior distribution. It’s a map of your current map. With this in mind, model checking is an extremely handy way to make sure that your model of your prior is reasonable.
However, a difference in the data and a simulation from your model doesn’t necessarily mean that you have an unreasonable model of your prior. You could just have really wrong priors. So you have to think about what’s going on to be sure. This does somewhat limit the role of model checking relative to what Gelman is pushing.
After the fact model checking is completely incompatible with perfect Bayesianism, if we define perfect Bayesianism as
Define a model with some parameters.
Pick a prior over the parameters.
Collect evidence.
Calculate the likelihood using the evidence and model.
Calculate the posterior by multiplying the prior by the likelihood.
When new evidence comes in, set the prior to the posterior and go to step 4.
There’s no step for checking if you should reject the model; there’s no provision here for deciding if you ‘just have really wrong priors.’ In practice, of course, we often do check to see if the model makes sense in light of new evidence, but then I wouldn’t think we’re operating like perfect Bayesians any more. I would expect a perfect Bayesian to operate according to the Cox-Jaynes-Yudkowsky way of thinking, which (if I understand them right) has no provision for model checking, only for updating according to the prior (or previous posterior) and likelihood.
My implicit definition of perfect Bayesian is characterized by these propostions:
There is a correct prior probability (as in, before you see any evidence, e.g. occam priors) for every proposition
Given a particular set of evidence, there is a correct posterior probability for any proposition
If we knew exactly what our priors were and how to exactly calculate our posteriors, then your steps 1-6 is exactly how we should operate. There’s no model checking because there is no model. The problem is, we don’t know these things. In practice we can’t exactly calculate our posteriors or precisely articulate our priors. So to approximate the correct posterior probability, we model our uncertainty about the proposition(s) in question. This includes every part of the model—the prior and the sampling model in the simplest case.
The rationale for model checking should be pretty clear at this point. How do we know if we have a good model of our uncertainty (or a good map of our map, to say it a different way)? One method is model checking. To forbid model checking when we know that we are modeling our uncertainty seems to be restricting the methods we can use to approximate our posteriors for no good reason.
Now I don’t necessarily think that Cox, Jaynes, Yudkowsky, or any other famous Bayesian agrees with me here. But when we got to model checking in my Bayes class, I spent a few days wondering how it squared with the Baysian philosophy of induction, and then what I took to be obvious answer came to me (while discussing it with my professor actually): we’re modeling our uncertainty. Just like we check our models of physics to see if they correspond to what we are trying to describe (reality), we should check our models of our uncertainty to see if they correspond to what we are trying to describe.
I would be interested to hear EY’s position on this issue though.
My implicit definition of perfect Bayesian is characterized by these propostions:
There is a correct prior probability (as in, before you see any evidence, e.g. occam priors) for every proposition
Given a particular set of evidence, there is a correct posterior probability for any proposition
OK, this is interesting: I think our ideas of perfect Bayesians might be quite different. I agree that #1 is part of how a perfect Bayesian thinks, if by ‘a correct prior...before you see any evidence’ you have the maximum entropy prior in mind.
I’m less sure what ‘correct posterior’ means in #2. Am I right to interpret it as saying that given a prior and a particular set of evidence for some empirical question, all perfect Bayesians should get the same posterior probability distribution after updating the prior with the evidence?
If we knew exactly what our priors were and how to exactly calculate our posteriors, then your steps 1-6 is exactly how we should operate. There’s no model checking because there is no model.
There has to be a model because the model is what we use to calculate likelihoods.
The rationale for model checking should be pretty clear …
Agree with this whole paragraph. I am in favor of model checking; my beef is with (what I understand to be) Perfect Bayesianism, which doesn’t seem to include a step for stepping outside the current model and checking that the model itself—and not just the parameter values—makes sense in light of new data.
I spent a few days wondering how it squared with the Baysian philosophy of induction, and then what I took to be obvious answer came to me (while discussing it with my professor actually): we’re modeling our uncertainty.
The catch here (if I’m interpreting Gelman and Shalizi correctly) is that building a sub-model of our uncertainty into our model isn’t good enough if that sub-model gets blindsided with unmodeled uncertainty that can’t be accounted for just by juggling probability density around in our parameter space.* From page 8 of their preprint:
If nothing else, our own experience suggests that however many different specifications we think of, there are always others which had not occurred to us, but cannot be immediately dismissed a priori, if only because they can be seen as alternative approximations to the ones we made. Yet the Bayesian agent is required to start with a prior distribution whose support covers all alternatives that could be considered.
* This must be one of the most dense/opaque sentences I’ve posted on Less Wrong. If anyone cares enough about this comment to want me to try and break down what it means with an example, I can give that a shot.
OK, this is interesting: I think our ideas of perfect Bayesians might be quite different.
They most certainly are. But it’s semantics.
I agree that #1 is part of how a perfect Bayesian thinks, if by ‘a correct prior...before you see any evidence’ you have the maximum entropy prior in mind.
Frankly, I’m not informed enough about priors commit to maxent, Kolmogorov complexity, or anything else.
I’m less sure what ‘correct posterior’ means in #2. Am I right to interpret it as saying that given a prior and a particular set of evidence for some empirical question, all perfect Bayesians should get the same posterior probability distribution after updating the prior with the evidence?
yes
There has to be a model because the model is what we use to calculate likelihoods.
aaahhh.… I changed the language of that sentence at least three times before settling on what you saw. Here’s what I probably should have posted (and what I was going to post until the last minute):
There’s no model checking because there is only one model—the correct model.
That is probably intuitively easier to grasp, but I think a bit inconsistent with my language in the rest of the post. The language is somewhat difficult here because our uncertainty is simultaneously a map and a territory.
The catch here (if I’m interpreting Gelman and Shalizi correctly) is that building a sub-model of our uncertainty into our model isn’t good enough if that sub-model gets blindsided with unmodeled uncertainty that can’t be accounted for just by juggling probability density around in our parameter space.*
For the record, I thought this sentence was perfectly clear. But I am a statistics grad student, so don’t consider me representative.
Are you asserting that this a catch for my position? Or the “never look back” approach to priors? What you are saying seems to support my argument.
OK. I agree with that insofar as agents having the same prior entails them having the same model.
aaahhh.… I changed the language of that sentence at least three times before settling on what you saw. Here’s what I probably should have posted (and what I was going to post until the last minute):
There’s no model checking because there is only one model—the correct model.
That is probably intuitively easier to grasp, but I think a bit inconsistent with my language in the rest of the post. The language is somewhat difficult here because our uncertainty is simultaneously a map and a territory.
Ah, I think I get you; a PB (perfect Bayesian) doesn’t see a need to test their model because whatever specific proposition they’re investigating implies a particular correct model.
For the record, I thought this sentence was perfectly clear. But I am a statistics grad student, so don’t consider me representative.
Yeah, I figured you wouldn’t have trouble with it since you talked about taking classes in this stuff—that footnote was intended for any lurkers who might be reading this. (I expected quite a few lurkers to be reading this given how often the Gelman and Shalizi paper’s been linked here.)
Are you asserting that this a catch for my position? Or the “never look back” approach to priors? What you are saying seems to support my argument.
It’s a catch for the latter, the PB. In reality most scientists typically don’t have a wholly unambiguous proposition worked out that they’re testing—or the proposition they are testing is actually not a good representation of the real situation.
I agree that #1 is part of how a perfect Bayesian thinks, if by ‘a correct prior...before you see any evidence’ you have the maximum entropy prior in mind.
Allow me to introduce to you the Brandeis dice problem. We have a six-sided die, sides marked 1 to 6, possibly unfair. We throw it many times (say, a billion) and obtain an average value of 3.5. Using that information alone, what’s your probability distribution for the next throw of the die? A naive application of the maxent approach says we should pick the distribution over {1,2,3,4,5,6} with mean 3.5 and maximum entropy, which is the uniform distribution; that is, the die is fair. But if we start with a prior over all possible six-sided dice and do Bayesian updating, we get a different answer that diverges from fairness more and more as the number of throws goes to infinity! The reason: a die that’s biased towards 3 and 4 makes a mean value of 3.5 even more likely than a fair die.
Does that mean you should give up your belief in maxent, your belief in Bayes, your belief in the existence of “perfect” priors for all problems, or something else? You decide.
But if we start with a prior over all possible six-sided dice and do Bayesian updating, we get a different answer that diverges from fairness more and more as the number of throws goes to infinity!
In this example, what information are we Bayesian updating on?
But if we start with a prior over all possible six-sided dies and do Bayesian updating, we get a different answer that diverges from fairness more and more as the number of throws goes to infinity!
I’m nearly positive that the linked paper (and in particular, the above-quoted conclusion) is just wrong. Many years ago I checked the calculations carefully and found that the results come from an unavailable computer program, so it’s definitely possible that the results were just due to a bug. Meanwhile, my paper copy of PT:LOS contains a section which purports to show that Bayesian updating and maximum entropy give the same answer in the large-sample limit. I checked the math there too, and it seemed sound.
I might be able to offer more than my unsupported assertions when I get home from work.
I’ve checked carefully in PT:LOS for the section I thought I remembered, but I can’t find it. I distinctly remember the form of the theorem (it was a squeeze theorem), but I do not recall where I saw it. I think Jaynes was the author, so it might be in one of the papers listed here… or it could have been someone else entirely, or I could be misremembering. But I don’t think I’m misremembering, because I recall working through the proof and becoming satisfied that Uffink must have made a coding error.
We throw it many times (say, a billion) and obtain an average value of 3.5. Using that information alone
So my prior state of knowledge about the die is entirely characterized by N=10^9 and m=3.5, with no knowledge of the shape of the distribution? It’s not obvious to me how you’re supposed to turn that, plus your background knowledge about what sort of object a die is, into a prior distribution; even one that maximizes entropy. The linked article mentions a “constraint rule” which seems to be an additional thing.
This sort of thing is rather thoroughly covered by Jaynes in PT:TLOS as I recall, and could make a good exercise for the Book Club when we come to the relevant chapters. In particular section 10.3 “How to cheat at coin and die tossing” contains the following caveat:
The results of tossing a die many times do not tell us any definite number char-
acteristic only of the die. They tell us also something about how the die was tossed. If
you toss ‘loaded’ dice in different ways, you can easily alter the relative frequencies of
the faces. With only slightly more difficulty, you can still do this if your dice are perfectly
‘honest’.
And later:
The problems in which intuition compels us most strongly to a uniform probability
assignment are not the ones in which we merely apply a principle of ‘equal distribution
of ignorance’. Thus, to explain the assignment of equal probabilities to heads and tails on the grounds that we ‘saw no reason why either face should be more likely than the other’, fails utterly to do justice to the reasoning involved. The point is that we have not merely ‘equal ignorance’. We also have positive knowledge of the symmetry of the problem; and introspection will show that when this positive knowledge is lacking, so also is our intuitive compulsion toward a uniform distribution.
Hah. The dice example and the application of maxent to it comes originally from Jaynes himself, see page 4 of the linked paper.
I’ll try to reformulate the problem without the constraint rule, to clear matters up or maybe confuse them even more. Imagine that, instead of you throwing the die a billion times and obtaining a mean of 3.5, a truthful deity told you that the mean was 3.5. First question: do you think the maxent solution in that case is valid, for some meaning of “valid”? Second question: why do you think it disagrees with Bayesian updating as you throw the die a huge number of times and learn only the mean? Is the information you receive somehow different in quality? Third question: which answer is actually correct, and what does “correct” mean here?
I’m not really qualified to comment on the methodological issues since I have yet to work through the formal meaning of “maximum entropy” approaches. What I know at this stage is the general argument for justifying priors, i.e. that they should in some manner reflect your actual state of knowledge (or uncertainty), rather than be tainted by preconceptions.
If you appeal to intuitions involving a particular physical object (a die) and simultaneously pick a particular mathematical object (the uniform prior) without making a solid case that the latter is our best representation the former, I won’t be overly surprised at some apparently absurd result.
It’s not clear to me for instance what we take a “possibly biased die” to be. Suppose I have a model that a cubic die is made biased by injecting a very small but very dense object at a particular (x,y,z) coordinate in a cubic volume. Now I can reason based on a prior distribution for (x,y,z) and what probability theory can possibly tell me about the posterior distribution, given a number of throws with a certain mean.
Now a six-sided die is normally symmetrical in such a way that 3 and 4 are on opposite sides, and I’m having trouble even seeing how a die could be biased “towards 3 and 4” under such conditions. Which means a prior which makes that a more likely outcome than a fair die should probably be ruled out by our formalization—or we should also model our uncertainty over how which faces have which numbers.
I’m having trouble even seeing how a die could be biased “towards 3 and 4” under such conditions.
If the die is slightly shorter along the 3-4 axis than along the 1-6 and 2-5 axes, then the 3 and 4 faces will have slightly greater surface area than the other faces.
Our models differ, then: I was assuming a strictly cubic die. So maybe we should also model our uncertainty over the dimensions of the (parallelepipedic) die.
But it seems in any case that we are circling back to the question of model checking, via the requirement that we should first be clear about what our uncertainty is about.
In the large N limit, and only the information that the mean is exactly 3.5, the obvious conclusion is that one is in a thought experiment, because that’s an absurd thing to choose to measure and an adversary has chosen the result to make us regret the choice.
More generally, one should revisit the hypothesis that the rolls of the die are independent. Yes, rolling only 1 and 6 is more likely to get a mean of 3.5 than rolling all six numbers, but still quite unlikely. Model checking!
EDIT: I am an eejit. Dangit, need to remember to stop and think before posting.
Umm, not quite.
The die being biased towards 2 and 5 gives the same probability of 3.5 as the die being 3,4 biased.
As does 1,6 bias.
So, given these three possibilities, an equal distribution is once again shown to be correct. By picking one of the three, and ignoring the other two, you can (accidentally) trick some people, but you cannot trick probability.
This is before even looking at the maths, and/or asking about the precision to which the mean is given (ie. is it 2 SF, 13 SF, 1 billion sf? Rounded to the nearest .5?)
Intuitively, I’d say that a die biased towards 1 and 6 makes hitting the mean (with some given precision) less likely than a die biased towards 3 and 4, because it spreads out the distribution wider. But you don’t have to take my word for it, see the linked paper for calculations.
Ahk, brainfart, it DOES depend on accuracy. I was thinking of it as so heavily biased that the other results don’t come up, and having perfect accuracy (rather than rounded to: what?)
Sorry, please vote down my previous post slightly (negative reinforcement for reacting too fast)
Hopefully I’ll find information about the rounding in the paper.
I don’t fault you for not having known all of this, but this information was a few Google searches away. Your advice is clearly inapplicable in this case.
You have, as has been pointed out, failed to understand the purpose of my comment. You will notice I never stated anything about this paper merely some basic guidelines to follow for determining if the paper is worth the effort to read, if one doesn’t have significant knowledge of the field within which the paper was written.
I apologize if my purpose was not clear, but your comment is completely irrelevant and misguided.
This seems extremely pertinent for LW: a paper by Andrew Gelman and Cosma Shalizi. Abstract:
I’m still reading it so I don’t have anything to say about it, and I’m not very statistics-savvy so I doubt I’ll have much to say about it after I read it, but I thought others here would find it an interesting read.
I stole this from a post by mjgeddes over in the OB open thread for July (Aside: mjgeddes, why all the hate? Where’s the love, brotha?)
steven0461 already posted this to the previous Open Thread and we had a nice little talk.
I wrote a backlink to here from OB. I am not yet expert enough to do an evaluation of this. I do think however that it is an important and interesting question that mjgeddes asks. As an active (although at a low level) rationalist I think it is important to try to at least to some extent follow what expert philosophers of science actually find out of how we can obtain reasonably reliable knowledge. The dominating theory of how science proceeds seems to be the hypothetico-deductive model, somewhat informally described. No formalised model for the scientific process seems so far has been able to answer to serious criticism of in the philosophy of science community. “Bayesianism” seems to be a serious candidate for such a formalised model but seems still to be developed further if it should be able to anser all serious criticism. The recent article by Gelman and Shalizi is of course just the latest in a tradition of bayesian-critique. A classic article is Glymour “Why I am Not a Bayesian” (also in the reference list of Gelman and Shalizi). That is from 1980 so probably a lot has happened since then. I myself am not up-to-date with most of development, but it seems to be an import topic to discuss here on Less Wrong that seems to be quite bayesianistically oriented.
ETA: Never mind. I got my crackpots confused.
Original text was:
mjgeddes was once publicly dissed by Eliezer Yudkowsky on OB (can’t find the link now, but it was a pretty harsh display of contempt). Since then, he has often bashed Bayesian induction, presumably in an effort to undercut EY’s world view and thereby hurt EY as badly as he himself was hurt.
You’re probably not thinking of this On Geddes.
No, not that. Geddes made a comment on OB about eating a meal with EY during which he made some well-meaning remark about EY becoming more like Geddes as EY grows older, and noticing an expression of contempt (if memory serves) on EY’s face. EY’s reply on OB made it clear that he had zero esteem for Geddes.
Nope, that was Jef Allbright.
No wonder I couldn’t find the link. Yeesh. One of these days I’ll learn to notice when I’m confused.
I’m not expert enough to interpret.
But I know Shalizi is skeptical of Bayesians and some of his blog posts seem so directly targeted at the LessWrong point of view that I almost suspect he’s read this stuff. Getting in contact with him would be a coup.
(Fixed) link to earlier discussion of this paper in the last open thread.
(Edit—that’s what I get for posting in this thread without refreshing the page. cousin_it already linked it.)
Yesterday, I posted my thoughts in last month’s thread on the article. I’m reproducing them here since this is where the discussion is at:
After the fact model checking is completely incompatible with perfect Bayesianism, if we define perfect Bayesianism as
Define a model with some parameters.
Pick a prior over the parameters.
Collect evidence.
Calculate the likelihood using the evidence and model.
Calculate the posterior by multiplying the prior by the likelihood.
When new evidence comes in, set the prior to the posterior and go to step 4.
There’s no step for checking if you should reject the model; there’s no provision here for deciding if you ‘just have really wrong priors.’ In practice, of course, we often do check to see if the model makes sense in light of new evidence, but then I wouldn’t think we’re operating like perfect Bayesians any more. I would expect a perfect Bayesian to operate according to the Cox-Jaynes-Yudkowsky way of thinking, which (if I understand them right) has no provision for model checking, only for updating according to the prior (or previous posterior) and likelihood.
My implicit definition of perfect Bayesian is characterized by these propostions:
There is a correct prior probability (as in, before you see any evidence, e.g. occam priors) for every proposition
Given a particular set of evidence, there is a correct posterior probability for any proposition
If we knew exactly what our priors were and how to exactly calculate our posteriors, then your steps 1-6 is exactly how we should operate. There’s no model checking because there is no model. The problem is, we don’t know these things. In practice we can’t exactly calculate our posteriors or precisely articulate our priors. So to approximate the correct posterior probability, we model our uncertainty about the proposition(s) in question. This includes every part of the model—the prior and the sampling model in the simplest case.
The rationale for model checking should be pretty clear at this point. How do we know if we have a good model of our uncertainty (or a good map of our map, to say it a different way)? One method is model checking. To forbid model checking when we know that we are modeling our uncertainty seems to be restricting the methods we can use to approximate our posteriors for no good reason.
Now I don’t necessarily think that Cox, Jaynes, Yudkowsky, or any other famous Bayesian agrees with me here. But when we got to model checking in my Bayes class, I spent a few days wondering how it squared with the Baysian philosophy of induction, and then what I took to be obvious answer came to me (while discussing it with my professor actually): we’re modeling our uncertainty. Just like we check our models of physics to see if they correspond to what we are trying to describe (reality), we should check our models of our uncertainty to see if they correspond to what we are trying to describe.
I would be interested to hear EY’s position on this issue though.
OK, this is interesting: I think our ideas of perfect Bayesians might be quite different. I agree that #1 is part of how a perfect Bayesian thinks, if by ‘a correct prior...before you see any evidence’ you have the maximum entropy prior in mind.
I’m less sure what ‘correct posterior’ means in #2. Am I right to interpret it as saying that given a prior and a particular set of evidence for some empirical question, all perfect Bayesians should get the same posterior probability distribution after updating the prior with the evidence?
There has to be a model because the model is what we use to calculate likelihoods.
Agree with this whole paragraph. I am in favor of model checking; my beef is with (what I understand to be) Perfect Bayesianism, which doesn’t seem to include a step for stepping outside the current model and checking that the model itself—and not just the parameter values—makes sense in light of new data.
The catch here (if I’m interpreting Gelman and Shalizi correctly) is that building a sub-model of our uncertainty into our model isn’t good enough if that sub-model gets blindsided with unmodeled uncertainty that can’t be accounted for just by juggling probability density around in our parameter space.* From page 8 of their preprint:
* This must be one of the most dense/opaque sentences I’ve posted on Less Wrong. If anyone cares enough about this comment to want me to try and break down what it means with an example, I can give that a shot.
They most certainly are. But it’s semantics.
Frankly, I’m not informed enough about priors commit to maxent, Kolmogorov complexity, or anything else.
yes
aaahhh.… I changed the language of that sentence at least three times before settling on what you saw. Here’s what I probably should have posted (and what I was going to post until the last minute):
That is probably intuitively easier to grasp, but I think a bit inconsistent with my language in the rest of the post. The language is somewhat difficult here because our uncertainty is simultaneously a map and a territory.
For the record, I thought this sentence was perfectly clear. But I am a statistics grad student, so don’t consider me representative.
Are you asserting that this a catch for my position? Or the “never look back” approach to priors? What you are saying seems to support my argument.
OK. I agree with that insofar as agents having the same prior entails them having the same model.
Ah, I think I get you; a PB (perfect Bayesian) doesn’t see a need to test their model because whatever specific proposition they’re investigating implies a particular correct model.
Yeah, I figured you wouldn’t have trouble with it since you talked about taking classes in this stuff—that footnote was intended for any lurkers who might be reading this. (I expected quite a few lurkers to be reading this given how often the Gelman and Shalizi paper’s been linked here.)
It’s a catch for the latter, the PB. In reality most scientists typically don’t have a wholly unambiguous proposition worked out that they’re testing—or the proposition they are testing is actually not a good representation of the real situation.
Allow me to introduce to you the Brandeis dice problem. We have a six-sided die, sides marked 1 to 6, possibly unfair. We throw it many times (say, a billion) and obtain an average value of 3.5. Using that information alone, what’s your probability distribution for the next throw of the die? A naive application of the maxent approach says we should pick the distribution over {1,2,3,4,5,6} with mean 3.5 and maximum entropy, which is the uniform distribution; that is, the die is fair. But if we start with a prior over all possible six-sided dice and do Bayesian updating, we get a different answer that diverges from fairness more and more as the number of throws goes to infinity! The reason: a die that’s biased towards 3 and 4 makes a mean value of 3.5 even more likely than a fair die.
Does that mean you should give up your belief in maxent, your belief in Bayes, your belief in the existence of “perfect” priors for all problems, or something else? You decide.
In this example, what information are we Bayesian updating on?
I’m nearly positive that the linked paper (and in particular, the above-quoted conclusion) is just wrong. Many years ago I checked the calculations carefully and found that the results come from an unavailable computer program, so it’s definitely possible that the results were just due to a bug. Meanwhile, my paper copy of PT:LOS contains a section which purports to show that Bayesian updating and maximum entropy give the same answer in the large-sample limit. I checked the math there too, and it seemed sound.
I might be able to offer more than my unsupported assertions when I get home from work.
I’ve checked carefully in PT:LOS for the section I thought I remembered, but I can’t find it. I distinctly remember the form of the theorem (it was a squeeze theorem), but I do not recall where I saw it. I think Jaynes was the author, so it might be in one of the papers listed here… or it could have been someone else entirely, or I could be misremembering. But I don’t think I’m misremembering, because I recall working through the proof and becoming satisfied that Uffink must have made a coding error.
So my prior state of knowledge about the die is entirely characterized by N=10^9 and m=3.5, with no knowledge of the shape of the distribution? It’s not obvious to me how you’re supposed to turn that, plus your background knowledge about what sort of object a die is, into a prior distribution; even one that maximizes entropy. The linked article mentions a “constraint rule” which seems to be an additional thing.
This sort of thing is rather thoroughly covered by Jaynes in PT:TLOS as I recall, and could make a good exercise for the Book Club when we come to the relevant chapters. In particular section 10.3 “How to cheat at coin and die tossing” contains the following caveat:
And later:
Hah. The dice example and the application of maxent to it comes originally from Jaynes himself, see page 4 of the linked paper.
I’ll try to reformulate the problem without the constraint rule, to clear matters up or maybe confuse them even more. Imagine that, instead of you throwing the die a billion times and obtaining a mean of 3.5, a truthful deity told you that the mean was 3.5. First question: do you think the maxent solution in that case is valid, for some meaning of “valid”? Second question: why do you think it disagrees with Bayesian updating as you throw the die a huge number of times and learn only the mean? Is the information you receive somehow different in quality? Third question: which answer is actually correct, and what does “correct” mean here?
I think I’d answer, “the mean of what?” ;)
I’m not really qualified to comment on the methodological issues since I have yet to work through the formal meaning of “maximum entropy” approaches. What I know at this stage is the general argument for justifying priors, i.e. that they should in some manner reflect your actual state of knowledge (or uncertainty), rather than be tainted by preconceptions.
If you appeal to intuitions involving a particular physical object (a die) and simultaneously pick a particular mathematical object (the uniform prior) without making a solid case that the latter is our best representation the former, I won’t be overly surprised at some apparently absurd result.
It’s not clear to me for instance what we take a “possibly biased die” to be. Suppose I have a model that a cubic die is made biased by injecting a very small but very dense object at a particular (x,y,z) coordinate in a cubic volume. Now I can reason based on a prior distribution for (x,y,z) and what probability theory can possibly tell me about the posterior distribution, given a number of throws with a certain mean.
Now a six-sided die is normally symmetrical in such a way that 3 and 4 are on opposite sides, and I’m having trouble even seeing how a die could be biased “towards 3 and 4” under such conditions. Which means a prior which makes that a more likely outcome than a fair die should probably be ruled out by our formalization—or we should also model our uncertainty over how which faces have which numbers.
If the die is slightly shorter along the 3-4 axis than along the 1-6 and 2-5 axes, then the 3 and 4 faces will have slightly greater surface area than the other faces.
Our models differ, then: I was assuming a strictly cubic die. So maybe we should also model our uncertainty over the dimensions of the (parallelepipedic) die.
But it seems in any case that we are circling back to the question of model checking, via the requirement that we should first be clear about what our uncertainty is about.
Cyan, I was hoping you’d show up. What do you think about this whole mess?
I find myself at a loss to give a brief answer. Can you ask a more specific question?
In the large N limit, and only the information that the mean is exactly 3.5, the obvious conclusion is that one is in a thought experiment, because that’s an absurd thing to choose to measure and an adversary has chosen the result to make us regret the choice.
More generally, one should revisit the hypothesis that the rolls of the die are independent. Yes, rolling only 1 and 6 is more likely to get a mean of 3.5 than rolling all six numbers, but still quite unlikely. Model checking!
EDIT: I am an eejit. Dangit, need to remember to stop and think before posting.
Umm, not quite. The die being biased towards 2 and 5 gives the same probability of 3.5 as the die being 3,4 biased.
As does 1,6 bias.
So, given these three possibilities, an equal distribution is once again shown to be correct. By picking one of the three, and ignoring the other two, you can (accidentally) trick some people, but you cannot trick probability.
This is before even looking at the maths, and/or asking about the precision to which the mean is given (ie. is it 2 SF, 13 SF, 1 billion sf? Rounded to the nearest .5?)
EDIT: this appears to be incorrect, sorry.
Intuitively, I’d say that a die biased towards 1 and 6 makes hitting the mean (with some given precision) less likely than a die biased towards 3 and 4, because it spreads out the distribution wider. But you don’t have to take my word for it, see the linked paper for calculations.
Ahk, brainfart, it DOES depend on accuracy. I was thinking of it as so heavily biased that the other results don’t come up, and having perfect accuracy (rather than rounded to: what?)
Sorry, please vote down my previous post slightly (negative reinforcement for reacting too fast)
Hopefully I’ll find information about the rounding in the paper.
Can anyone with more experience with Bayesian statistics than me evaluate this article?
EDIT: This is not an evaluation of the particular paper in question merely some general evaluation guidelines which are useful.
Drop dead easy way to evaluate the paper without reading it: (Not a standard to live by but it works)
1.) look up the authors if they are professors or experts great if its a nobody or a student ignore and discard or take with a grain of salt
2.) was the paper published and where (if on arxiv BEWARE it takes really no skill to get your work posted there anyone can do it)
Criteria: If paper written by respectable authorities or ones who’s opinion can be trusted or where you have enough knowledge to filter for mistakes
If the paper was published in a quality journal or you have enough knowledge to filter
Then if both conditions are met, I find you can do a good job filtering the papers not worth reading.
Apologies for being blunt, but your comment is nigh on useless: Andrew Gelman is a stats professor at Columbia who co-authored a book on Bayesian statistics (incidentally, he was also interviewed a while back by Eliezer on BHTV), while Cosma Shalizi is a stats professor at Carnegie Mellon who is somewhat well-known for his excellent Notebooks.
I don’t fault you for not having known all of this, but this information was a few Google searches away. Your advice is clearly inapplicable in this case.
You’re missing the point, which was not to evaluate that specific paper, but to provide some general heuristics for quickly evaluating a paper.
You have, as has been pointed out, failed to understand the purpose of my comment. You will notice I never stated anything about this paper merely some basic guidelines to follow for determining if the paper is worth the effort to read, if one doesn’t have significant knowledge of the field within which the paper was written.
I apologize if my purpose was not clear, but your comment is completely irrelevant and misguided.
Also:
3) Check for grammar, spelling, capitalization, and punctuation.