From skimming the paper, it looks like the issue is how they’re defining closeness of models. They consider a fair coin and a coin that lands on heads 51% of the time to be close, even though the prior probability of a billion consecutive heads is very different under each of those models. I would consider those two models to be distant, perhaps infinitely so. One assumes that the coin is fair, and the other assumes that it is not. Close models would give similar probabilities to fair coins.
Not sure why you’re being downvoted; the metric used to define “similar” or “closeness” is absolutely what’s at issue here. Their choice of metric doesn’t care very much about falsely assigning a probability of zero to a hypothesis, and Bayesian inference does care very much about whether you falsely assign a probability of zero to a hypothesis.
That being said, I won’t consider this a complete rebuttal until I see someone listing metrics under which Bayesian inference is well-posed and we can see if any of them are useful. Energy distance is a nice one for practical reasons, for example; does it also play well with Bayes?
I don’t understand you. Neither “a 51% percent coin” nor “a fair coin” are probability distributions, and the choice of metric in question is “metric on spaces of probability distributions”. Could you clarify?
Although, I could take your statement at face value, too. Want to make a few million $1 bets with me? We’ll either be using “rand < .5” or “rand < .51″ to decide when I win; since trying to distinguish between the two is useless you don’t need to bother.
You could call them Bernoulli distributions representing aleatory uncertainty on a single coin flip, I suppose. Bayesian updates of purely aleatory uncertainty aren’t very interesting, though, are they? Your evidence is “I looked at it, it’s heads”, and your posterior is “It was heads that time”.
I suppose you could add some uncertainty to the evidence; maybe we’re looking at a coin flip through a blurry telescope? But in any context, Bernoulli distributions from a finite-dimensional probability distribution space mean that Bayesian updates on them are still well-posed. The concern here is that infinite-dimensional spaces of probability distributions don’t always lead to well-posed Bayesian updates, depending on what metric you use to define well-posed. If there’s also a concern that this can happen on Bernoulli distributions then I’d like to see an example; if not then that’s a red herring.
Also, once you are not limited to a single flip and can flip the coins multiple times, you graduate to binomial distributions which are highly useful and for which Bayesian updates are sufficiently interesting :-)
The maximum of the absolute value of the log of the ratio between the probability of a given hypothesis on each prior. That is the log of the highest possible odds of a piece of evidence that brings you from one prior to the other.
I’m unclear on your terminology. I take a prior to be a distribution over distributions; in practice, usually a distribution over the parameters of a parameterised family. Let P1 and P2 be two priors of this sort, distributions over some parameter space Q. Write P1(q) for the probability density at q, and P1(x|q) for the probability density at x for parameter q. x varies over the data space X.
Is the distance measure you are proposing max_{q in Q} abs log( P1(q) / P2(q) )?
Or is it max_{q in Q,x in X} abs log( P1(x|q) / P2(x|q) )?
Or max_{q in Q,x in X} abs log( (P1(q)P1(x|q)) / (P2(q)P2(x|q)) )?
A distribution over distributions just becomes a distribution. Just use P(x) = integral_{p} P(x|q)P(q)dq. The distance I’m proposing is max_x abs log(P1(x) / P2(x)) = max_x abs (log(integral_{p} P1(x|q) P1(q) dq) - integral_p P2(x|q) P2(q) dq)).
I think it might be possible to make this better. If Alice and Bob both agree that x is unlikely, then both disagreeing about the probability seems like less of a problem. For example, if Alice thinks it’s one-in-a-million, and Bob think it’s one-in-a-billion, then Alice would need a thousand-to-one evidence ratio to believe what Bob believes which means that that piece of evidence has a one-in-a-thousand chance of occurring, but since it only has a one-in-a-million chance of being needed, that doesn’t matter much. It seems like it would only make a one-in-a-thousand difference. If you do it this way, it would need to be additive, but the distance is still at most the metric I just gave.
I think energy distance doesn’t work: the “notched” distributions that their work uses lie close to the original distribution in that distance, as they do for total variation and Prokhorov distance. I am guessing that Kullback-Leibler doesn’t work either, provided the notches don’t go all the way to zero. You just make the notch low enough to get a high probability for the desired posterior, then make it narrow enough to reduce KL divergence as low as you want.
If it is assumed that the observations are only made to finite precision (e.g. each observation takes the form of a probability distribution of entropy bounded from below) it’s not clear to me what happens to their results. In terms of their examples, they depend on being able to narrow the notch arbitrarily and still contain the observed data with certainly. That can’t be done if the data are only known with bounded precision.
Consider reading the paper.
From skimming the paper, it looks like the issue is how they’re defining closeness of models. They consider a fair coin and a coin that lands on heads 51% of the time to be close, even though the prior probability of a billion consecutive heads is very different under each of those models. I would consider those two models to be distant, perhaps infinitely so. One assumes that the coin is fair, and the other assumes that it is not. Close models would give similar probabilities to fair coins.
Not sure why you’re being downvoted; the metric used to define “similar” or “closeness” is absolutely what’s at issue here. Their choice of metric doesn’t care very much about falsely assigning a probability of zero to a hypothesis, and Bayesian inference does care very much about whether you falsely assign a probability of zero to a hypothesis.
That being said, I won’t consider this a complete rebuttal until I see someone listing metrics under which Bayesian inference is well-posed and we can see if any of them are useful. Energy distance is a nice one for practical reasons, for example; does it also play well with Bayes?
Any metric whereby a 51% percent coin isn’t close to a fair coin is useless in practice.
I don’t understand you. Neither “a 51% percent coin” nor “a fair coin” are probability distributions, and the choice of metric in question is “metric on spaces of probability distributions”. Could you clarify?
Although, I could take your statement at face value, too. Want to make a few million $1 bets with me? We’ll either be using “rand < .5” or “rand < .51″ to decide when I win; since trying to distinguish between the two is useless you don’t need to bother.
Of course they are, they represent Bernoulli distributions.
You could call them Bernoulli distributions representing aleatory uncertainty on a single coin flip, I suppose. Bayesian updates of purely aleatory uncertainty aren’t very interesting, though, are they? Your evidence is “I looked at it, it’s heads”, and your posterior is “It was heads that time”.
I suppose you could add some uncertainty to the evidence; maybe we’re looking at a coin flip through a blurry telescope? But in any context, Bernoulli distributions from a finite-dimensional probability distribution space mean that Bayesian updates on them are still well-posed. The concern here is that infinite-dimensional spaces of probability distributions don’t always lead to well-posed Bayesian updates, depending on what metric you use to define well-posed. If there’s also a concern that this can happen on Bernoulli distributions then I’d like to see an example; if not then that’s a red herring.
I also don’t understand the downvote. Is there a single sentence in the above post that’s mistaken? If so then a correction would be appreciated.
Also, once you are not limited to a single flip and can flip the coins multiple times, you graduate to binomial distributions which are highly useful and for which Bayesian updates are sufficiently interesting :-)
The maximum of the absolute value of the log of the ratio between the probability of a given hypothesis on each prior. That is the log of the highest possible odds of a piece of evidence that brings you from one prior to the other.
I’m unclear on your terminology. I take a prior to be a distribution over distributions; in practice, usually a distribution over the parameters of a parameterised family. Let P1 and P2 be two priors of this sort, distributions over some parameter space Q. Write P1(q) for the probability density at q, and P1(x|q) for the probability density at x for parameter q. x varies over the data space X.
Is the distance measure you are proposing max_{q in Q} abs log( P1(q) / P2(q) )?
Or is it max_{q in Q,x in X} abs log( P1(x|q) / P2(x|q) )?
Or max_{q in Q,x in X} abs log( (P1(q)P1(x|q)) / (P2(q)P2(x|q)) )?
Or something else?
A distribution over distributions just becomes a distribution. Just use P(x) = integral_{p} P(x|q)P(q)dq. The distance I’m proposing is max_x abs log(P1(x) / P2(x)) = max_x abs (log(integral_{p} P1(x|q) P1(q) dq) - integral_p P2(x|q) P2(q) dq)).
I think it might be possible to make this better. If Alice and Bob both agree that x is unlikely, then both disagreeing about the probability seems like less of a problem. For example, if Alice thinks it’s one-in-a-million, and Bob think it’s one-in-a-billion, then Alice would need a thousand-to-one evidence ratio to believe what Bob believes which means that that piece of evidence has a one-in-a-thousand chance of occurring, but since it only has a one-in-a-million chance of being needed, that doesn’t matter much. It seems like it would only make a one-in-a-thousand difference. If you do it this way, it would need to be additive, but the distance is still at most the metric I just gave.
The metric for this would be:
integral_x log(max(P1(x), P2(x)) max(P1(x) / P2(x), P2(x) / P1(x)))
= integral_x log(max(P1^2(x) / P2(x), P2^2(x) / P1(x)))
I think energy distance doesn’t work: the “notched” distributions that their work uses lie close to the original distribution in that distance, as they do for total variation and Prokhorov distance. I am guessing that Kullback-Leibler doesn’t work either, provided the notches don’t go all the way to zero. You just make the notch low enough to get a high probability for the desired posterior, then make it narrow enough to reduce KL divergence as low as you want.
If it is assumed that the observations are only made to finite precision (e.g. each observation takes the form of a probability distribution of entropy bounded from below) it’s not clear to me what happens to their results. In terms of their examples, they depend on being able to narrow the notch arbitrarily and still contain the observed data with certainly. That can’t be done if the data are only known with bounded precision.
No coin is 100% fair.