The Bayesian using the universal prior is cheating. Kolmogorov complexity is incomputable. So you have to use a realistic compressor. Then the optimality results go away. But, in practice if your compressor is good, it’s almost always going to outdo the frequentist.
Yep, the universal prior should be filed under “science fiction”. And you know why it’s uncomputable? Because for any computable prior I can create a coin-producing machine that will anticipate and thwart the Bayesian’s expectations :-)
The frequentist approach is analogous to a minimax strategy in a game: no matter how malicious the universe is, you still get your 90%. Other, non-minimax strategies inevitably take risks to try and exploit stupid opponents. This connection was formalized by Abraham Wald, who was mentioned by Cyan in the two posts that started this whole affair.
Replying to Wei Dai’s post, I still don’t know which magic is the stronger one or what “stronger” actually means here. In actual statistical practice, choosing good priors clearly requires skills and techniques that aren’t part of the naive Bayesian canon. This creates a doubt in my mind that just won’t go away.
The frequentist approach is analogous to a minimax strategy in a game: no matter how malicious the universe is, you still get your 90%. Other, non-minimax strategies inevitably take risks to try and exploit stupid opponents.
If ‘the universe is malicious’ is the right prior to use then a Bayesian will use it.
Using a universal prior on a malicious universe can only give a score that differs by a constant from the frequentist over infinite time.
Any sane computable approximation of the universal prior would necessarily include a weight on the hypothesis ‘the universe is more evil/complex than than can be represented with algorithms of the length that this approximation handles’.
If the universe is malicious you probably shouldn’t be doing statistics on it.
In actual statistical practice, choosing good priors clearly requires skills and techniques that aren’t part of the naive Bayesian canon.
This is true. That is when it is time to use frequentist approximations.
In actual statistical practice, choosing good priors clearly requires skills and techniques that aren’t part of the naive Bayesian canon.
This is true. That is when it is time to use frequentist approximations.
Frequentist techniques can be useful in certain situations, but only because of the great difficulty in assigning accurate priors and the fact that we often have such overwhelmingly large amounts of evidence that any hypothesis with a substancial prior will probably have a posterior either near-zero or near-one if proper Bayesian reasoning is used. In those situations, judging the probability of a hypothesis by its p-values saves a lot of complicated work and is almost as good. However, when there is not overwhelming evidence or when the evidence points to a hypothesis with a negligible prior, frequentist statistics ceases to provide an adequate approximation. Always remember to think like a Bayesian, even when using frequentist methods.
Also, this was a terrible example of frequentist statistics avoiding the use of priors, because the prior was that the probability that a coin would land heads anything other than 90% or 10% of the time is zero. This is a tremendous assertion, and refusing to specify the probabilities of 90% heads vs. 10% heads just makes the prior incomplete. Saying that there is no prior is like saying that I did not make this post just because the last sentence lacks a period
because the prior was that the probability that a coin would land heads anything other than 90% or 10% of the time is zero. This is a tremendous assertion, and refusing to specify the probabilities of 90% heads vs. 10% heads just makes the prior incomplete
It has wings that are way too big for a mammal, and it lays eggs! Which just makes it an incomplete mammal.
It’s really more than that. Consider the great anti-Bayesian Cosma Shalizi. He’s shown that the use of a prior is really equivalent to a method of smoothing, of regularization on your hypothesis space, trading off (frequentist) bias and variance. And everyone has a regularization scheme, even if they claim it is “don’t: ‘let the data speak for themselves’”.
Your assertion that the machine is exactly 90% biased is exactly equivalent to an evenly-split point-mass prior at 10% and 90%. A Bayesian with that prior would exactly reproduce your reasoning. You’re absolutely right that it is in no way an incomplete prior—it exactly specifies everything. One could consider it an overconfident prior, but if you do in fact know that it’s either 10% or 90%, and have no idea which it’s perfectly appropriate. The choice of which Frequentist statistical techniques to use is pretty much isomorphic to the choice of Bayesian priors.
EDIT: On the bias/variance trade-off, I just realized that the Frequentist prescription for binary frequency estimation s/(s+f) is, rather than a maximum entropy prior, a maximum variance prior. (To be specific it is an improper prior, the limiting case of the beta distribution, with both parameters going to zero. Although this has support everywhere in [0,1], if you try to normalize it, the integral going to infinity kills everything except delta spikes at 0 and 1. But normalizing after conditioning on data gives a reasonable prior, with the maximum likelihood estimate being the standard estimate.)
onsider the great anti-Bayesian Cosma Shalizi. He’s shown that the use of a prior is really equivalent to a method of smoothing, of regularization on your hypothesis space, trading off (frequentist) bias and variance.
It seems odd to interpret this point as anti-Bayesian. To me it seems pro-Bayesian: it means that whenever you use a regularizer you’re actually doing Bayesian inference. Any method that depends on a regularizer is open to the same critique of subjectivity to which Bayesian methods are vulnerable. Two frequentists using different regularizers will come to different conclusions based on the same evidence, and the choice of a regularizer is hardly inevitable or dictated by the problem.
If you have a link to a paper that contains anti-Bayesian arguments by Shalizi, I would be interested in reading it.
Well, it seems odd to me too. He has another rant up comparing Bayesian updating to evolution saying “okay, that’s why Bayesian updating seems to actually work OK in many cases”, whereas I see that as explaining why evolution works...
Is a good start. He also has a paper on the arXiv that is flat-out wrong, so ignore “The Backwards Arrow of Time of the Coherently Bayesian Statistical Mechanic”, though showing how it goes wrong takes a fair bit of explaining of fairly subtle points.
He also has a paper on the arXiv that is flat-out wrong, so ignore “The Backwards Arrow of Time of the Coherently Bayesian Statistical Mechanic”, though showing how it goes wrong takes a fair bit of explaining of fairly subtle points.
I’ve tried reading it before—for me to understand just the paper itself would also take a fair bit of explaining of fairly subtle points! I understand Shalizi’s sketch of his argument in words:
Observe your system at time 0, and invoke your favorite way of going from an observation to a distribution over the system’s states—say the maximum entropy principle. This distribution will have some Shannon entropy, which by hypothesis is also the system’s thermodynamic entropy. Assume the system’s dynamics are invertible, so that the state at time t determines the states at times t+1 and t-1. This will be the case if the system obeys the usual laws of classical mechanics, for example. Now let your system evolve forward in time for one time-step. It’s a basic fact about invertible dynamics that they leave Shannon entropy invariant, so it’s still got whatever entropy it had when you started. Now make a new observation. If you update your probability distribution using Bayes’s rule, a basic result in information theory shows that the Shannon entropy of the posterior distribution is, on average, no more than that of the prior distribution. There’s no way an observation can make you more uncertain about the state on average, though particular observations may be very ambiguous. (Noise-free measurements would let us drop the “on average” qualifer.) Repeating this, we see that entropy decreases over time (on average).
My problem is that I know a plausible-looking argument expressed in words can still quite easily be utterly wrong in some subtle way, so I don’t know how much credence to give Shalizi’s argument.
The problem with the quoted argument from Shalizi is that it is describing a decrease in entropy over time of an open system. To track a closed system, you have to include the brain that is making observations and updating its beliefs. Making the observations requires thermodynamic work that can transfer entropy.
The frequentist approach is analogous to a minimax strategy in a game: no matter how malicious the universe is, you still get your 90%.
This is because you constrained the universe to only being able to present with you sequences of one of two possible values, both of which reveal themselves on first inspection 90% of the time.
Let the universe throw at you a machine that, for all you know, can produce any distribution or pattern of coins with any bias. Try to get your guaranteed 90% making predictions on the bias of those coins from a single flip.
I didn’t constrain the universe, Wei Dai did. You wanna talk about another problem, fine. I assume you have some secret Bayesian technique for solving it?
The Bayesian using the universal prior is cheating. Kolmogorov complexity is incomputable. So you have to use a realistic compressor. Then the optimality results go away. But, in practice if your compressor is good, it’s almost always going to outdo the frequentist.
Yep, the universal prior should be filed under “science fiction”. And you know why it’s uncomputable? Because for any computable prior I can create a coin-producing machine that will anticipate and thwart the Bayesian’s expectations :-)
The frequentist approach is analogous to a minimax strategy in a game: no matter how malicious the universe is, you still get your 90%. Other, non-minimax strategies inevitably take risks to try and exploit stupid opponents. This connection was formalized by Abraham Wald, who was mentioned by Cyan in the two posts that started this whole affair.
Replying to Wei Dai’s post, I still don’t know which magic is the stronger one or what “stronger” actually means here. In actual statistical practice, choosing good priors clearly requires skills and techniques that aren’t part of the naive Bayesian canon. This creates a doubt in my mind that just won’t go away.
If ‘the universe is malicious’ is the right prior to use then a Bayesian will use it.
Using a universal prior on a malicious universe can only give a score that differs by a constant from the frequentist over infinite time.
Any sane computable approximation of the universal prior would necessarily include a weight on the hypothesis ‘the universe is more evil/complex than than can be represented with algorithms of the length that this approximation handles’.
If the universe is malicious you probably shouldn’t be doing statistics on it.
This is true. That is when it is time to use frequentist approximations.
Frequentist techniques can be useful in certain situations, but only because of the great difficulty in assigning accurate priors and the fact that we often have such overwhelmingly large amounts of evidence that any hypothesis with a substancial prior will probably have a posterior either near-zero or near-one if proper Bayesian reasoning is used. In those situations, judging the probability of a hypothesis by its p-values saves a lot of complicated work and is almost as good. However, when there is not overwhelming evidence or when the evidence points to a hypothesis with a negligible prior, frequentist statistics ceases to provide an adequate approximation. Always remember to think like a Bayesian, even when using frequentist methods.
Also, this was a terrible example of frequentist statistics avoiding the use of priors, because the prior was that the probability that a coin would land heads anything other than 90% or 10% of the time is zero. This is a tremendous assertion, and refusing to specify the probabilities of 90% heads vs. 10% heads just makes the prior incomplete. Saying that there is no prior is like saying that I did not make this post just because the last sentence lacks a period
It has wings that are way too big for a mammal, and it lays eggs! Which just makes it an incomplete mammal.
It’s really more than that. Consider the great anti-Bayesian Cosma Shalizi. He’s shown that the use of a prior is really equivalent to a method of smoothing, of regularization on your hypothesis space, trading off (frequentist) bias and variance. And everyone has a regularization scheme, even if they claim it is “don’t: ‘let the data speak for themselves’”.
Your assertion that the machine is exactly 90% biased is exactly equivalent to an evenly-split point-mass prior at 10% and 90%. A Bayesian with that prior would exactly reproduce your reasoning. You’re absolutely right that it is in no way an incomplete prior—it exactly specifies everything. One could consider it an overconfident prior, but if you do in fact know that it’s either 10% or 90%, and have no idea which it’s perfectly appropriate. The choice of which Frequentist statistical techniques to use is pretty much isomorphic to the choice of Bayesian priors.
EDIT: On the bias/variance trade-off, I just realized that the Frequentist prescription for binary frequency estimation s/(s+f) is, rather than a maximum entropy prior, a maximum variance prior. (To be specific it is an improper prior, the limiting case of the beta distribution, with both parameters going to zero. Although this has support everywhere in [0,1], if you try to normalize it, the integral going to infinity kills everything except delta spikes at 0 and 1. But normalizing after conditioning on data gives a reasonable prior, with the maximum likelihood estimate being the standard estimate.)
Cosma’s writings for those interested
It seems odd to interpret this point as anti-Bayesian. To me it seems pro-Bayesian: it means that whenever you use a regularizer you’re actually doing Bayesian inference. Any method that depends on a regularizer is open to the same critique of subjectivity to which Bayesian methods are vulnerable. Two frequentists using different regularizers will come to different conclusions based on the same evidence, and the choice of a regularizer is hardly inevitable or dictated by the problem.
If you have a link to a paper that contains anti-Bayesian arguments by Shalizi, I would be interested in reading it.
Well, it seems odd to me too. He has another rant up comparing Bayesian updating to evolution saying “okay, that’s why Bayesian updating seems to actually work OK in many cases”, whereas I see that as explaining why evolution works...
http://cscs.umich.edu/~crshalizi/weblog/cat_bayes.html
Is a good start. He also has a paper on the arXiv that is flat-out wrong, so ignore “The Backwards Arrow of Time of the Coherently Bayesian Statistical Mechanic”, though showing how it goes wrong takes a fair bit of explaining of fairly subtle points.
I’ve tried reading it before—for me to understand just the paper itself would also take a fair bit of explaining of fairly subtle points! I understand Shalizi’s sketch of his argument in words:
My problem is that I know a plausible-looking argument expressed in words can still quite easily be utterly wrong in some subtle way, so I don’t know how much credence to give Shalizi’s argument.
The problem with the quoted argument from Shalizi is that it is describing a decrease in entropy over time of an open system. To track a closed system, you have to include the brain that is making observations and updating its beliefs. Making the observations requires thermodynamic work that can transfer entropy.
D’oh! Why didn’t I think of that?!
If you write such a post, I’ll almost certainly upvote it.
Just Google him—his website is full of tons of interesting stuff.
This is because you constrained the universe to only being able to present with you sequences of one of two possible values, both of which reveal themselves on first inspection 90% of the time.
Let the universe throw at you a machine that, for all you know, can produce any distribution or pattern of coins with any bias. Try to get your guaranteed 90% making predictions on the bias of those coins from a single flip.
I didn’t constrain the universe, Wei Dai did. You wanna talk about another problem, fine. I assume you have some secret Bayesian technique for solving it?