cubefox comments on Mistakes people make when thinking about units

cubefox 25 Jun 2024 15:52 UTC
10 points
0
I liked this post. All this sounds obvious enough when you read it like that, but it can’t be too obvious when even someone like Matt Parker got it wrong.

A further question: What type of units are probabilities? They seem different from from e.g. meters or sheep.

For example, probabilities are bounded below and above (0 and 1) while sheep (and meters etc) are only bounded below (0). (And “million” isn’t bounded at all.) So the expression “twice as many” does make sense for sheep or meters, but arguably not for probabilities, or at least not always. Because a 0.8 probability times two would yield a “1.6 probability”, which is not a probability. And then it seems also questionable whether doubling the probability 0.01 is “morally the same” as doubling the probability 0.5. Yet e.g. in medical testing and treatment trials, ratios of probabilities are often used, e.g. the likelihood ratio $\frac{P (E | H)}{P (E | \neg H)}$ for some hypothesis H and some evidence E, or the risk ratio $\frac{P (H | E)}{P (H | \neg E)}$ . Which say one probability is (for example) “double” the other probability, while implicitly assuming doubling a small probability is comparable to doubling a relatively large one. Case in point: Doubling the probability of survival doesn’t imply anything about the factor with which the probability of death changes, except that it is between 0 and 1.

Another thing I found interesting is that probabilities can only be added if they are “mutually exclusive”. But that’s actually the same as for meters and sheep. If there are two sheep on the yard and three sheep on the property, they can only be added if “sheep on the yard” and “sheep on the property” are mutually exclusive. Otherwise we would be double counting sheep. And when adding length measurements, we also have to avoid double counting (double measuring).

Moreover, two probabilities can be validly multiplied when they are “independent”. Does this also have an analogy for sheep? I can’t think of one, as multiplying sheep seems generally nonsensical. But multiplying meters does indeed make sense in certain cases. It yields an area if the multiplied lengths were measured in a right angle to each other. I’m not sure whether there is any further connection to probability, but both this and being independent are sometimes called “orthogonal”.
- JenniferRM 25 Jun 2024 18:26 UTC
  8 points
  0
  Parent
  Log odds, measured in something like “bits of evidence” or “decibels of evidence”, is the natural thing to think of yourself as “counting”. A probability of 100% would be like having infinite positive evidence for a claim and a probability of 0% is like having infinite negative evidence for a claim. Arbital has some math and Eliezer has a good old essay on this.
  A good general heuristic (or widely applicable hack) to “fix your numbers to even be valid numbers” when trying to get probabilities for things based on counts (like a fast and dirty spreadsheet analysis), and never having this spit out 0% or 100% due to naive division on small numbers (like seeing 3 out of 3 of something and claiming it means the probability of that thing is probably 100%), is to use “pseudo-counting” where every category that is analytically possible is treated as having been “observed once in our imaginations”. This way, if you can fail or succeed, and you’ve seen 3 of either, and seen nothing else, you can use pseudocounts to guesstimate that whatever happened every time so far is (3+1)/(3+2) == 80% likely in the future, and whatever you’ve never seen is (0+1)/(3+2) == 20% likely.
  - cubefox 26 Jun 2024 23:01 UTC
    2 points
    0
    Parent
    
    Log odds, measured in something like “bits of evidence” or “decibels of evidence”, is the natural thing to think of yourself as “counting”. A probability of 100% would be like having infinite positive evidence for a claim and a probability of 0% is like having infinite negative evidence for a claim. Arbital has some math and Eliezer has a good old essay on this.
    
    Odds (and log odds) solve some problems but they unfortunately create others.
    
    For addition and multiplication they at least seem to make things worse. We know that we can add probabilities if they are “mutually exclusive” to get the probability of their disjunction, and we know we can multiply them if they are “independent” to get the probability of their conjunction. But when can we add two odds, or multiply two odds? (Or log odds) And what would be the interpretation of the result?
    
    On the other hand, unlike for probabilities, the multiplication with constants does indeed seem unproblematic for odds. (Or the addition of constants for logs.) E.g. “doubling” some odds makes always sense due to them being unbounded from above, while doubling probabilities is not always possible. And when it is, it is questionable whether it has any sensible interpretation.
    
    But the way Arbital and Eliezer handle it doesn’t actually make use of this fact. They instead treat the likelihood ratio (or its logarithm) as evidence strength. But, as I said, the likelihood ratio is actually a ratio of probabilities, not of odds, in which case the interpretation as evidence strength is shaky. The likelihood ratio assumes that doubling a small probability of the evidence constitutes the same evidence strength as doubling a relatively large one, which seems not right.
    
    As a formal example, assume the hypothesis $H$ doubles the probability of evidence $E_{1}$ compared to $\neg H$ . That is, we have the likelihood ratio $\frac{P (E_{1} | H)}{P (E_{1} | \neg H)} = 2$ . Since ${log}_{2} (2) = 1$ , $E_{1}$ is interpreted to constitute 1 bit of evidence in favor of $H$ .
    
    Then assume we also have some evidence $E_{2}$ that is doubled by $H$ compared to $\neg H$ . So $E_{2}$ is interpreted to also be 1 bit of evidence in favor of $H$ .
    
    Does this mean both cases involve equal evidence strength? Arguably no. For example, the probability of $E_{1}$ may be quite small while the probability of $E_{2}$ may be quite large. This would mean $H$ hardly decreases the probability of $\neg E_{1}$ compared to $\neg H$ , while $H$ strongly decreases the probability of $\neg E_{2}$ compared to $\neg H$ . So $\frac{P (\neg E_{1} | H)}{P (\neg E_{1} | \neg H)} ≫ \frac{P (\neg E_{2} | H)}{P (\neg E_{2} | \neg H)}$ .
    
    So according to the likelihood ratio theory, $E_{1}$ would be moderate (1 bit) evidence for $H$ , and $E_{2}$ would be equally moderate evidence for $H$ , but $\neg E_{1}$ would be very weak evidence against $H$ while $\neg E_{2}$ would be very strong evidence against $H$ .
    
    That seems implausible. Arguably $E_{2}$ is here much stronger evidence for $H$ than $E_{1}$ .
    
    Here is a more concrete example:
    
    $H$ = The patient actually suffers from Diseasitis.
    
    $E_{1}$ = The patient suffers from Diseasitis according to test 1.
    
    $E_{2}$ = The patient suffers from Diseasitis according to test 2.
    
    $P (E_{1} | H) = 0.02$
    
    $P (E_{1} | \neg H) = 0.01$
    
    $P (E_{2} | H) = 0.98$
    
    $P (E_{2} | \neg H) = 0.49$
    
    Log likelihood ratio $E_{1}$ : ${log}_{2} (\frac{P (E_{1} | H)}{P (E_{1} | \neg H)}) = 1 bit$
    
    Log likelihood ratio $E_{2}$ : ${log}_{2} (\frac{P (E_{2} | H)}{P (E_{2} | \neg H)}) = 1 bit$
    
    So this says both tests represent equally strong evidence.
    
    What if we instead take the ratio of conditional odds, instead of the ratio of conditional probabilities (as in the likelihood ratio)?
    
    $O = \frac{P}{1 - P}$
    
    Log odds ratio $E_{1}$ : ${log}_{2} (\frac{O (E_{1} | H)}{O (E_{1} | \neg H)}) = 1.0146 bits$
    
    Log odds ratio $E_{2}$ : ${log}_{2} (\frac{O (E_{2} | H)}{O (E_{2} | \neg H)}) = 5.6724 bits$
    
    So the odds ratios are actually pretty different. Unlike the likelihood ratio, the odds ratio agrees with my argument that $E_{2}$ is significantly stronger evidence than $E_{1}$ .
- notfnofn 25 Jun 2024 19:08 UTC
  1 point
  0
  Parent
  I was thinking about this a few weeks ago. The answer is your units are related to the probability measure, and care is needed. Here’s the context:
  Let’s say I’m in the standard set-up for linear regression: I have a bunch of input vectors ${{\to x}_{i}}_{i = 1, \dots n} \in R^{k}$ and for some unknown $\to μ \in R^{k}$ and $σ^{2} > 0,$ the outputs $y_{i}$ are independent with distributions
  $y_{i} \sim {\to x}_{i} \cdot \to μ + N (0, σ^{2})$
  Let $X$ denote the $n \times k$ matrix whose $i$ th row is ${\to x}_{i}$ , assumed to be full rank. Let $^μ$ denote the random vector corresponding to the fitted estimate of $\to μ$ using ordinary least squares linear regression and let $s^{2}$ denote the sum of squared residuals. It can be shown geometrically that:
  $\frac{s^{2}}{σ^{2}} \sim χ_{n - k}^{2}, \frac{\to μ -^μ}{σ^{2}} \sim N (\to 0, (X^{t} X)^{- 1}), \frac{\to μ -^μ}{s^{2}} \sim \frac{N (\to 0, (X^{t} X)^{- 1})}{χ_{n - k}^{2}}$
  (informally, the density of $\frac{\to μ -^μ}{s^{2}}$ is that of the random variable corresponding to sampling a multivariate gaussian with mean $\to 0 \in R^{k}$ and covariance matrix $(X^{t} X)^{- 1}$ , then sampling an independent $χ_{n - k}^{2}$ distribution and dividing by the result). A naive undergrad might misinterpret this as meaning that after observing $\to y$ and computing $^μ, s^{2}$ :
  $σ^{2} \sim \frac{s^{2}}{χ_{n - k}^{2}}, \to μ ∣ σ^{2} \sim σ^{2} N (^μ, (X^{t} X)^{- 1}), \to μ \sim \frac{s^{2} N (^μ, (X^{t} X)^{- 1})}{χ_{n - k}^{2}}$
  But of course, this can’t be true in general because we did not even mention a prior. But on the other hand, this is exactly the family of conjugate priors/posteriors in Bayesian linear regression… so what possibly-improper prior makes this the posterior?
  I won’t spoil the whole thing for you (partly because I’ve accidentally spent too much time writing this comment!) but start with just $σ^{2}$ and $s^{2}$ and:
  1. Calculate the exact posterior density of $σ^{2}$ desired in terms of $χ_{n - k}^{2}$
  2. Use Bayes theorem to figure out the prior
  I personally messed up several times on step 2 because I was being extremely naive about the “units” cancelling in Bayes theorem. When I finally made it all precise using measures, things actually cancelled properly and got the correct improper prior distribution on $σ^{2}, \to μ$ .
  (If anyone wants me to finish fleshing out the idea, please let me know).
  - cubefox 26 Jun 2024 23:23 UTC
    2 points
    0
    Parent
    Thanks for the effort, though unfortunately I’m not familiar with linear regression.