I liked this post. All this sounds obvious enough when you read it like that, but it can’t be too obvious when even someone like Matt Parker got it wrong.
A further question: What type of units are probabilities? They seem different from from e.g. meters or sheep.
For example, probabilities are bounded below and above (0 and 1) while sheep (and meters etc) are only bounded below (0). (And “million” isn’t bounded at all.) So the expression “twice as many” does make sense for sheep or meters, but arguably not for probabilities, or at least not always. Because a 0.8 probability times two would yield a “1.6 probability”, which is not a probability. And then it seems also questionable whether doubling the probability 0.01 is “morally the same” as doubling the probability 0.5. Yet e.g. in medical testing and treatment trials, ratios of probabilities are often used, e.g. the likelihood ratio P(E|H)P(E|¬H) for some hypothesis H and some evidence E, or the risk ratio P(H|E)P(H|¬E). Which say one probability is (for example) “double” the other probability, while implicitly assuming doubling a small probability is comparable to doubling a relatively large one. Case in point: Doubling the probability of survival doesn’t imply anything about the factor with which the probability of death changes, except that it is between 0 and 1.
Another thing I found interesting is that probabilities can only be added if they are “mutually exclusive”. But that’s actually the same as for meters and sheep. If there are two sheep on the yard and three sheep on the property, they can only be added if “sheep on the yard” and “sheep on the property” are mutually exclusive. Otherwise we would be double counting sheep. And when adding length measurements, we also have to avoid double counting (double measuring).
Moreover, two probabilities can be validly multiplied when they are “independent”. Does this also have an analogy for sheep? I can’t think of one, as multiplying sheep seems generally nonsensical. But multiplying meters does indeed make sense in certain cases. It yields an area if the multiplied lengths were measured in a right angle to each other. I’m not sure whether there is any further connection to probability, but both this and being independent are sometimes called “orthogonal”.
Log odds, measured in something like “bits of evidence” or “decibels of evidence”, is the natural thing to think of yourself as “counting”. A probability of 100% would be like having infinite positive evidence for a claim and a probability of 0% is like having infinite negative evidence for a claim. Arbital has some math and Eliezer has a good old essay on this.
A good general heuristic (or widely applicable hack) to “fix your numbers to even be valid numbers” when trying to get probabilities for things based on counts (like a fast and dirty spreadsheet analysis), and never having this spit out 0% or 100% due to naive division on small numbers (like seeing 3 out of 3 of something and claiming it means the probability of that thing is probably 100%), is to use “pseudo-counting” where every category that is analytically possible is treated as having been “observed once in our imaginations”. This way, if you can fail or succeed, and you’ve seen 3 of either, and seen nothing else, you can use pseudocounts to guesstimate that whatever happened every time so far is (3+1)/(3+2) == 80% likely in the future, and whatever you’ve never seen is (0+1)/(3+2) == 20% likely.
Log odds, measured in something like “bits of evidence” or “decibels of evidence”, is the natural thing to think of yourself as “counting”. A probability of 100% would be like having infinite positive evidence for a claim and a probability of 0% is like having infinite negative evidence for a claim. Arbital has some math and Eliezer has a good old essay on this.
Odds (and log odds) solve some problems but they unfortunately create others.
For addition and multiplication they at least seem to make things worse. We know that we can add probabilities if they are “mutually exclusive” to get the probability of their disjunction, and we know we can multiply them if they are “independent” to get the probability of their conjunction. But when can we add two odds, or multiply two odds? (Or log odds) And what would be the interpretation of the result?
On the other hand, unlike for probabilities, the multiplication with constants does indeed seem unproblematic for odds. (Or the addition of constants for logs.) E.g. “doubling” some odds makes always sense due to them being unbounded from above, while doubling probabilities is not always possible. And when it is, it is questionable whether it has any sensible interpretation.
But the way Arbital and Eliezer handle it doesn’t actually make use of this fact. They instead treat the likelihood ratio (or its logarithm) as evidence strength. But, as I said, the likelihood ratio is actually a ratio of probabilities, not of odds, in which case the interpretation as evidence strength is shaky. The likelihood ratio assumes that doubling a small probability of the evidence constitutes the same evidence strength as doubling a relatively large one, which seems not right.
As a formal example, assume the hypothesis H doubles the probability of evidence E1 compared to ¬H. That is, we have the likelihood ratio P(E1|H)P(E1|¬H)=2. Since log2(2)=1, E1 is interpreted to constitute 1 bit of evidence in favor of H.
Then assume we also have some evidence E2 that is doubled by H compared to ¬H. So E2 is interpreted to also be 1 bit of evidence in favor of H.
Does this mean both cases involve equal evidence strength? Arguably no. For example, the probability of E1 may be quite small while the probability of E2 may be quite large. This would mean H hardly decreases the probability of ¬E1 compared to ¬H, while H strongly decreases the probability of ¬E2 compared to ¬H. So P(¬E1|H)P(¬E1|¬H)≫P(¬E2|H)P(¬E2|¬H).
So according to the likelihood ratio theory, E1 would be moderate (1 bit) evidence for H, and E2 would be equally moderate evidence for H, but ¬E1 would be very weak evidence against H while ¬E2 would be very strong evidence against H.
That seems implausible. Arguably E2 is here much stronger evidence for H than E1.
Here is a more concrete example:
H = The patient actually suffers from Diseasitis.
E1 = The patient suffers from Diseasitis according to test 1.
E2 = The patient suffers from Diseasitis according to test 2.
P(E1|H)=0.02
P(E1|¬H)=0.01
P(E2|H)=0.98
P(E2|¬H)=0.49
Log likelihood ratio E1: log2(P(E1|H)P(E1|¬H))=1 bit
Log likelihood ratio E2: log2(P(E2|H)P(E2|¬H))=1 bit
So this says both tests represent equally strong evidence.
What if we instead take the ratio of conditional odds, instead of the ratio of conditional probabilities (as in the likelihood ratio)?
O=P1−P
Log odds ratio E1: log2(O(E1|H)O(E1|¬H))=1.0146 bits
Log odds ratio E2: log2(O(E2|H)O(E2|¬H))=5.6724 bits
So the odds ratios are actually pretty different. Unlike the likelihood ratio, the odds ratio agrees with my argument that E2 is significantly stronger evidence than E1.
I was thinking about this a few weeks ago. The answer is your units are related to the probability measure, and care is needed. Here’s the context:
Let’s say I’m in the standard set-up for linear regression: I have a bunch of input vectors {→xi}i=1,…n∈Rk and for some unknown →μ∈Rk and σ2>0, the outputs yi are independent with distributions
yi∼→xi⋅→μ+N(0,σ2)
Let X denote the n×k matrix whose ith row is →xi, assumed to be full rank. Let ^μ denote the random vector corresponding to the fitted estimate of →μ using ordinary least squares linear regression and let s2 denote the sum of squared residuals. It can be shown geometrically that:
(informally, the density of →μ−^μs2 is that of the random variable corresponding to sampling a multivariate gaussian with mean →0∈Rk and covariance matrix (XtX)−1, then sampling an independent χ2n−k distribution and dividing by the result). A naive undergrad might misinterpret this as meaning that after observing →yand computing ^μ,s2:
But of course, this can’t be true in general because we did not even mention a prior. But on the other hand, this is exactly the family of conjugate priors/posteriors in Bayesian linear regression… so what possibly-improper prior makes this the posterior?
I won’t spoil the whole thing for you (partly because I’ve accidentally spent too much time writing this comment!) but start with just σ2 and s2 and:
Calculate the exact posterior density of σ2 desired in terms of χ2n−k
Use Bayes theorem to figure out the prior
I personally messed up several times on step 2 because I was being extremely naive about the “units” cancelling in Bayes theorem. When I finally made it all precise using measures, things actually cancelled properly and got the correct improper prior distribution on σ2,→μ.
(If anyone wants me to finish fleshing out the idea, please let me know).
I liked this post. All this sounds obvious enough when you read it like that, but it can’t be too obvious when even someone like Matt Parker got it wrong.
A further question: What type of units are probabilities? They seem different from from e.g. meters or sheep.
For example, probabilities are bounded below and above (0 and 1) while sheep (and meters etc) are only bounded below (0). (And “million” isn’t bounded at all.) So the expression “twice as many” does make sense for sheep or meters, but arguably not for probabilities, or at least not always. Because a 0.8 probability times two would yield a “1.6 probability”, which is not a probability. And then it seems also questionable whether doubling the probability 0.01 is “morally the same” as doubling the probability 0.5. Yet e.g. in medical testing and treatment trials, ratios of probabilities are often used, e.g. the likelihood ratio P(E|H)P(E|¬H) for some hypothesis H and some evidence E, or the risk ratio P(H|E)P(H|¬E). Which say one probability is (for example) “double” the other probability, while implicitly assuming doubling a small probability is comparable to doubling a relatively large one. Case in point: Doubling the probability of survival doesn’t imply anything about the factor with which the probability of death changes, except that it is between 0 and 1.
Another thing I found interesting is that probabilities can only be added if they are “mutually exclusive”. But that’s actually the same as for meters and sheep. If there are two sheep on the yard and three sheep on the property, they can only be added if “sheep on the yard” and “sheep on the property” are mutually exclusive. Otherwise we would be double counting sheep. And when adding length measurements, we also have to avoid double counting (double measuring).
Moreover, two probabilities can be validly multiplied when they are “independent”. Does this also have an analogy for sheep? I can’t think of one, as multiplying sheep seems generally nonsensical. But multiplying meters does indeed make sense in certain cases. It yields an area if the multiplied lengths were measured in a right angle to each other. I’m not sure whether there is any further connection to probability, but both this and being independent are sometimes called “orthogonal”.
Log odds, measured in something like “bits of evidence” or “decibels of evidence”, is the natural thing to think of yourself as “counting”. A probability of 100% would be like having infinite positive evidence for a claim and a probability of 0% is like having infinite negative evidence for a claim. Arbital has some math and Eliezer has a good old essay on this.
A good general heuristic (or widely applicable hack) to “fix your numbers to even be valid numbers” when trying to get probabilities for things based on counts (like a fast and dirty spreadsheet analysis), and never having this spit out 0% or 100% due to naive division on small numbers (like seeing 3 out of 3 of something and claiming it means the probability of that thing is probably 100%), is to use “pseudo-counting” where every category that is analytically possible is treated as having been “observed once in our imaginations”. This way, if you can fail or succeed, and you’ve seen 3 of either, and seen nothing else, you can use pseudocounts to guesstimate that whatever happened every time so far is (3+1)/(3+2) == 80% likely in the future, and whatever you’ve never seen is (0+1)/(3+2) == 20% likely.
Odds (and log odds) solve some problems but they unfortunately create others.
For addition and multiplication they at least seem to make things worse. We know that we can add probabilities if they are “mutually exclusive” to get the probability of their disjunction, and we know we can multiply them if they are “independent” to get the probability of their conjunction. But when can we add two odds, or multiply two odds? (Or log odds) And what would be the interpretation of the result?
On the other hand, unlike for probabilities, the multiplication with constants does indeed seem unproblematic for odds. (Or the addition of constants for logs.) E.g. “doubling” some odds makes always sense due to them being unbounded from above, while doubling probabilities is not always possible. And when it is, it is questionable whether it has any sensible interpretation.
But the way Arbital and Eliezer handle it doesn’t actually make use of this fact. They instead treat the likelihood ratio (or its logarithm) as evidence strength. But, as I said, the likelihood ratio is actually a ratio of probabilities, not of odds, in which case the interpretation as evidence strength is shaky. The likelihood ratio assumes that doubling a small probability of the evidence constitutes the same evidence strength as doubling a relatively large one, which seems not right.
As a formal example, assume the hypothesis H doubles the probability of evidence E1 compared to ¬H. That is, we have the likelihood ratio P(E1|H)P(E1|¬H)=2. Since log2(2)=1, E1 is interpreted to constitute 1 bit of evidence in favor of H.
Then assume we also have some evidence E2 that is doubled by H compared to ¬H. So E2 is interpreted to also be 1 bit of evidence in favor of H.
Does this mean both cases involve equal evidence strength? Arguably no. For example, the probability of E1 may be quite small while the probability of E2 may be quite large. This would mean H hardly decreases the probability of ¬E1 compared to ¬H, while H strongly decreases the probability of ¬E2 compared to ¬H. So P(¬E1|H)P(¬E1|¬H)≫P(¬E2|H)P(¬E2|¬H).
So according to the likelihood ratio theory, E1 would be moderate (1 bit) evidence for H, and E2 would be equally moderate evidence for H, but ¬E1 would be very weak evidence against H while ¬E2 would be very strong evidence against H.
That seems implausible. Arguably E2 is here much stronger evidence for H than E1.
Here is a more concrete example:
H = The patient actually suffers from Diseasitis.
E1 = The patient suffers from Diseasitis according to test 1.
E2 = The patient suffers from Diseasitis according to test 2.
P(E1|H)=0.02
P(E1|¬H)=0.01
P(E2|H)=0.98
P(E2|¬H)=0.49
Log likelihood ratio E1: log2(P(E1|H)P(E1|¬H))=1 bit
Log likelihood ratio E2: log2(P(E2|H)P(E2|¬H))=1 bit
So this says both tests represent equally strong evidence.
What if we instead take the ratio of conditional odds, instead of the ratio of conditional probabilities (as in the likelihood ratio)?
O=P1−P
Log odds ratio E1: log2(O(E1|H)O(E1|¬H))=1.0146 bits
Log odds ratio E2: log2(O(E2|H)O(E2|¬H))=5.6724 bits
So the odds ratios are actually pretty different. Unlike the likelihood ratio, the odds ratio agrees with my argument that E2 is significantly stronger evidence than E1.
I was thinking about this a few weeks ago. The answer is your units are related to the probability measure, and care is needed. Here’s the context:
Let’s say I’m in the standard set-up for linear regression: I have a bunch of input vectors {→xi}i=1,…n∈Rk and for some unknown →μ∈Rk and σ2>0, the outputs yi are independent with distributions
yi∼→xi⋅→μ+N(0,σ2)
Let X denote the n×k matrix whose ith row is →xi, assumed to be full rank. Let ^μ denote the random vector corresponding to the fitted estimate of →μ using ordinary least squares linear regression and let s2 denote the sum of squared residuals. It can be shown geometrically that:
s2σ2∼χ2n−k,→μ−^μσ2∼N(→0,(XtX)−1),→μ−^μs2∼N(→0,(XtX)−1)χ2n−k
(informally, the density of →μ−^μs2 is that of the random variable corresponding to sampling a multivariate gaussian with mean →0∈Rk and covariance matrix (XtX)−1, then sampling an independent χ2n−k distribution and dividing by the result). A naive undergrad might misinterpret this as meaning that after observing →y and computing ^μ,s2:
σ2∼s2χ2n−k,→μ∣σ2∼σ2N(^μ,(XtX)−1),→μ∼s2N(^μ,(XtX)−1)χ2n−k
But of course, this can’t be true in general because we did not even mention a prior. But on the other hand, this is exactly the family of conjugate priors/posteriors in Bayesian linear regression… so what possibly-improper prior makes this the posterior?
I won’t spoil the whole thing for you (partly because I’ve accidentally spent too much time writing this comment!) but start with just σ2 and s2 and:
Calculate the exact posterior density of σ2 desired in terms of χ2n−k
Use Bayes theorem to figure out the prior
I personally messed up several times on step 2 because I was being extremely naive about the “units” cancelling in Bayes theorem. When I finally made it all precise using measures, things actually cancelled properly and got the correct improper prior distribution on σ2,→μ.
(If anyone wants me to finish fleshing out the idea, please let me know).
Thanks for the effort, though unfortunately I’m not familiar with linear regression.