Haziq Muhammad comments on Jaynesian interpretation—How does “estimating probabilities” make sense?

Haziq Muhammad 22 Jul 2021 10:43 UTC
2 points
I am very grateful for your answer but I have a few contentions from my paradigm of objective Bayesianism
1. You have replaced probability with a physical property: “frequency“. I have also seen other people use terms like bias-weighting, fairness, center of mass, etc. which are all properties of the coin, to sidestep this question. I have nothing against theta being a physical property such that P(heads|theta=alpha) = alpha. In fact, it would make a ton of sense to me if this actually were the case. But the issue is when people say that theta is a probability and treat it as if it was a physical property. I presume you don’t view probabilities to be physical properties. Even subjective Bayesians are not that evil...
2. “if Janes does not have access to the data that formed his prior or cannot explain it well, then what he believes about the coin and what the alien believes about the coin are both ‘rational’, as it is the posterior from their personal priors and the shared data.” If Professor Jaynes did not have access to the data that formed his prior, his prior would have been the same as the alien’s and they would have ended up with the same posterior. There is no such thing as a “personal prior”. I invite you to the light side: read Professor Jaynes’ book; it is absolutely brilliant
- Jan Christian Refsgaard 22 Jul 2021 11:54 UTC
  2 points
  Parent
  I may be to bad at philosophy to give a satisfying answer, and it may turn out that I actually do not know and am simply to dumb to realize that I should be confused about this :)
  1. There is a frequency of the coin in the real world, let’s say it has $θ = 0.5$
    Because I am not omniscient there is a distribution over $θ$ it’s parameterized by some prior which we ignore (let’s not fight about that :)) and some data x, thus In my head there exists a probability distribution $p (θ ∣ x)$
    The probability distribution on my head is a distribution not a scaler, I don’t know what $θ$ is but I may be 95% certain that it’s between 0.4 and 0.6
  2. I think there are problems with objective priors, but I am honored to have meet an objective Bayesian in the wild, so I would love to try to understand you, I am Jan Christian Refsgaard on the University of Bayes and Bayesian conspiracy discord servers. My main critique is the ‘in-variance’ of some priors under some transformations, but that is a very weak critique and my epistemology is very underdeveloped, also I just bought Jaynes book :) and will read when I find a study group, so who knows maybe I will be an objective Bayesian a year from now :)
  - Haziq Muhammad 22 Jul 2021 15:34 UTC
    2 points
    Parent
    Response to point one: I do find that to be satisfactory from a philosophical perspective but only because theta refers to a real-world property called frequency and not the probability of heads. My question to you is this: if you have a point estimate of theta or if you find the exact real world-value of theta (perhaps by measuring it with an ACME frequency-o-meter), what does it tell you about the probability of heads?
    Response to point two: The honour is mine :) If you ever create a study group or discord server for the book, then please count me in
    - Jan Christian Refsgaard 23 Jul 2021 6:37 UTC
      2 points
      Parent
      In Bayesian statistics there are two distributions which I think we are conflating here because they happen to have the same value
      The posterior $p (θ ∣ y)$ describes our uncertainty of $θ$ , given data (and prior information), so it’s how sure we are of the frequency of the coin
      The posterior predictive is our prediction for new coin flips $~ y$ given old coin flips $y$
      $p (~ y | y) = \int_{Θ} p (~ y ∣ θ, y) p (θ ∣ y) d θ$
      For the simple Bernoulli distribution coin example, the following issue arise: the parameter $θ$ , the posterior predictive and the posterior all have the same value, but they are different things.
      Here is an example were they are different:
      Here $θ$ was not a coin but the logistic intercept of some binary outcome with predictor variable x, let’s imagine an evil Nazi scientist poisoning people, then we could make a logistic model of y (alive/dead) such as $y = i n v l o g i t (a x + l o g i t (θ))$ , Let’s imagine that x is how much poison you ate above/below the average poison level, and that we have $θ = 0.5$ , so on average half died
      Now we have:
      the value if we were omniscient
      $θ = 0.5$
      The posterior of $θ$ because we are not omniscient there is error
      $p (θ | y) = 0.5 + ϵ$
      Predictions for two different y with uncertainty:
      $p ({~ y}_{lots of poison} ∣ y) = p (~ y ∣ y, ~ x = 2) = 0.99 \pm ϵ \approx 0.99$ $p ({~ y}_{average poison} ∣ y) = p (~ y ∣ y, ~ x = 0) = 0.5 \pm ϵ \approx 0.5$
      Does this help?
      I will PM you when we start reading Jaynes, we are currently reading Regression and other stories, but in about 20 weeks (done if we do 1 chapter per week) there is a good chance we will do Jaynes
      - Haziq Muhammad 23 Jul 2021 9:48 UTC
        2 points
        Parent
        To calculate the posterior predictive you need to calculate the posterior and to the calculate posterior you need to calculate the likelihood (in most problems). For the coin flipping example, what is the probability of heads and what is the probability of tails given that the frequency is equal to some value theta? You might accuse me of being completely devoid of intuition for asking this question but please bear with me...
        Sounds good. I thought nobody was interested in reading Professor Jaynes’ book anymore. It’s a shame more people don’t know about him
        Jan Christian Refsgaard 23 Jul 2021 21:21 UTC
        2 points
        Parent
        Given 1. your model and 2 the magical no uncertainty in theta, then it’s theta, the posterior predictive allows us to jump from infrence about parameters to infence about new data, it’s a distribution of y (coin flip outcomes) not theta (which describes the frequency)
        Haziq Muhammad 23 Jul 2021 23:20 UTC
        2 points
        Parent
        Think I have finally got it. I would like to thank you once again for all your help; I really appreciate it.
        This is what I think “estimating the probability” means:
        We define theta to be a real-world/objective/physical quantity s.t. P(H|theta=alpha) = alpha & P(T|theta=alpha) = 1 - alpha. We do not talk about the nature of this quantity theta because we do not care what it is. I don’t think it is appropriate to say that theta is “frequency” for this reason:
        “frequency” is not a well-defined physical quantity. You can’t measure “frequency” like you measure temperature.
        But we do not need to dispute about this as theta being “frequency” is unnecessary.
        Using the above definitions, we can compute the likelihood and then the posterior and then the posterior predictive which is represents the probability of heads in the next flip given data from previous flips.
        Is the above accurate?
        So Bayesians who say that theta is the probability of heads and compute a point estimate of the parameter theta and say that they have “estimated the probability” are just frequentists in disguise?
        Jan Christian Refsgaard 26 Jul 2021 15:44 UTC
        2 points
        Parent
        I think the above is accurate.
        I disagree with the last part, but it has two sources of confusion
        Frequentists vs Bayesian is in principle about priors but in practice about about point estimates vs distributions
        Good Frequentists use distributions and bad Bayesian use point estimates such as Bayes Factors, a good review is this is https://link.springer.com/article/10.3758/s13423-016-1221-4
        But the leap from theta to probability of heads I think is an intuitive leap that happens to be correct but unjustified.
        Philosophically then the posterior predictive is actually frequents, allow me to explain:
        Frequents are people who estimates a parameter and then draws fake samples from that point estimate and summarize it in confidence intervals, to justify this they imagine parallel worlds and what not.
        Bayesian are people who assumes a prior distributions from which the parameter is drawn, they thus have both prior and likelihood uncertainty which gives posterior uncertainty, which is the uncertainty of the parameters in their model, when a Bayesian wants to use his model to make predictions then they integrate their model parameters out and thus have a predictive distribution of new data given data*. Because this is a distribution of the data like the Frequentists sampling function, then we can actually draw from it multiple times to compute summary statistics much like the frequents, and calculate things such as a “Bayesian P-value” which describes how likely the model is to have generated our data, here the goal is for the p-value to be high because that suggests that the model describes the data well.
        *In the real world they do not integrate out theta, they draw it 10.000 times and use thous samples as a stand in distribution because the math is to hard for complex models
        Haziq Muhammad 26 Jul 2021 19:38 UTC
        2 points
        Parent
        Excellent! One final point that I would like to add is if we say that “theta is a physical quantity s.t. [...]“, we are faced with an ontological question: “does a physical quantity exist with these properties?”.
        
        I recently found about Professor Jaynes’ A_p distribution idea, it is introduced in chapter 18 of his book, from Maxwell Peterson in the sub-thread below and I believe it is an elegant workaround to this problem. It leads to the same results but is more satisfying philosophically.
        This is how it would work in the coin flipping example:
        Define A(u) to a function that maps from real numbers to propositions with domain [0, 1] s.t.
        1. The set of propositions {A(u): 0 ⇐ u ⇐ 1} is mutually exclusive and exhaustive
        2. P(y=1 | A(u)) = u and P(y=0 | A(u)) = 1 - u
        Because the set of propositions is mutually exclusive and exhaustive, there is one u s.t. A(u) is true and for any v != u, A(v) is false. We call this unique value of u: theta.
        
        It follows that P(y=1 | theta) = theta and P(y=0 | theta) = 1 - theta and we use this to calculate the posterior predictive distribution
        Jan Christian Refsgaard 23 Jul 2021 21:26 UTC
        2 points
        Parent
        Regarding reading Jaynes, my understanding is its good for intuition but bad for applied statistics because it does not teach you modern bayesian stuff such as WAIC and HMC, so you should first do one of the applied books. I also think Janes has nothing about causality.
        Haziq Muhammad 23 Jul 2021 23:09 UTC
        1 point
        Parent
        I‘m afraid I have to disagree. I do sometimes regret not focusing more on applied Bayesian inference. (In fact, I have no idea what WAIC or HMC is.) But in my defence, I am an amateur analytical-philosopher & logician and I couldn’t help finding more non-sequiturs in classical expositions of probability theory than plot-holes in Tolkien novels. Perhaps if had been more naive and less critical (no offence to anyone) when I read those books, I would have “progressed” faster. I had lost hope in understanding probability theory before I read Professor Jaynes’ book; that’s why I respect the man so much. Now I have the intuition but I am still trying to reconcile it with what I read in the applied literature. I sometimes find it frustrating that I am worrying about the philosophical nuances and intricacies of probability theory while others are applying their (perhaps less coherent) understanding of it to solve problems but I strongly believe it is worth it :)
        Jan Christian Refsgaard 26 Jul 2021 18:05 UTC
        2 points
        Parent
        I am one of those people with an half baked epistemology and understanding of probability theory, and I am looking forward to reading Janes. And I agree there are a lot of ad hocisms in probability theory which means everything is wrong in the logic sense as some of the assumptions are broken, but a solid moden bayesian approach has much less adhocisms and also teaches you to build advanced models in less than 400 pages.
        
        HMC is a sampling approach to solving the posterior which in practice is superior to analytical methods, because it actually accounts for correlations in predictors and other things which are usually assumed away.
        
        WAIC is information theory on distributions which allows you to say that model A is better than model B because the extra parameters in B are fitting noice, basically minimum description length on steroids for out of sample uncertainty.
        
        Also I studied biology which is the worst, I can perform experiments and thus do not have to think about causality and I do not expect my model to acout for half of the signal even if it’s ‘correct’