SilasBarta comments on Open Thread: January 2010

SilasBarta 8 Jan 2010 3:38 UTC
−1 points

I only call syllogisms about probabilities valid if they follow from Bayes’ Theorem. You permit yourself a meta-probability distribution over the probabilities and call a syllogism valid if it is Cyan::valid on average w.r.t. to your meta-distribution.

But you’re permitting yourself the same thing! Whenever you apply the Bayes Theorem, you’re asserting a probability distribution to hold, even though that might not be the true generating distribution of the phenomenon. You would reject the construction of such as scenario (where your inference is way off) as a “counterexample” or somehow showing the invalidity of updates performed under the Bayes theorem. And why? Because that distribution is the best probability estimate, on average, for scenarios in which you occupy that epistemic state.

All I’m saying is that the same situation holds with respect to undefined tokens. Given that you don’t know what D and H are, and given the two premises, your best estimate of P(H|D) is low. Can you find cases where it isn’t low? Sure, but not on average. Can you find cases where it necessarily isn’t low? Sure, but they involve moving to a different epistemic state.

No, a finite interval is not sufficient. You really need to specify the invariant measure to use maxent in the continuous case

Wrong:

The uniform distribution on the interval [a,b] is the maximum entropy distribution among all continuous distributions which are supported in the interval [a, b] (which means that the probability density is 0 outside of the interval).
- Cyan 8 Jan 2010 4:26 UTC
  0 points
  Parent
  
  But you’re permitting yourself the same thing! Whenever you apply the Bayes Theorem...
  
  Checks for a syllogism’s Cyan::validity do not apply Bayes’ Theorem per se. No prior and likelihood need be specified, and no posterior is calculated. The question is “can we start with Bayes’ Theorem as an equation, take whatever the premises assert about the variables in that equation (inequalities or whatever), and derive the conclusion?” Checks for SilasBarta::validity also don’t apply Bayes’ Theorem as far as I can tell—they just involve an extra element (a probability distribution for the variables of the Bayes’ Theorem equation) and an extra operation (expectation w.r.t. to the previously mentioned distribution).
  
  You would reject the construction of such as scenario (where your inference is way off) as a “counterexample” or somehow showing the invalidity of updates performed under the Bayes theorem.
  
  This is definitely a point of miscommunication, because I certainly never intended to impeach Bayes’ Theorem.
  
  Given that you don’t know what D and H are, and given the two premises, your best estimate of P(H|D) is low.
  
  Maybe. I’ve still yet to be convinced that it’s possible to derive a meta-probability distribution for the unconditional probabilities.
  
  Wrong:
  
  The text you link uses Shannon’s definition of the entropy of a continuous distribution, not Jaynes’s.
  - SilasBarta 8 Jan 2010 4:46 UTC
    1 point
    Parent
    
    But you’re permitting yourself the same thing! Whenever you apply the Bayes Theorem..
    
    Checks for a syllogism’s Cyan::validity do not apply Bayes’ Theorem per se. …
    
    Argh. I wasn’t saying that you were using the Bayes Theorem in your claimed definition of Cyan::validity. I was saying that when you are deriving probabilities through Bayesian inference, you are implicitly applying a standard of validity for probabilistic syllogisms—a standard that matches mine, and yields the conclusion I claimed about the syllogism in question.
    
    This is definitely a point of miscommunication, because I certainly never intended to impeach Bayes’ Theorem.
    
    Yes, definitely a miscommunication: my point there was that the existence of cases where Bayesian inference gives you a probability differing from the true distribution are not evidence for the Bayes Theorem being invalid. I don’t know how you read it before, but that was the point, and I hope it makes more sense now.
    
    Given that you don’t know what D and H are, and given the two premises, your best estimate of P(H|D) is low.
    
    Maybe. I’ve still yet to be convinced that it’s possible to derive a meta-probability distribution for the unconditional probabilities.
    
    Why? Because you don’t see how defining the variables is a kind of information you’re not allowed to have here? Because you think you can update (have a non-unity P(D)/P(H) ratio) in the absence of any information about P(D) and P(H)? Because you don’t see how the “member of Congress” case is an example of a low entropy, concentrated-probability-mass case? Because you reject meta-probabilities to begin with (in which case it’s not clear what makes probabilities found through Bayesian inference more “right” or “preferable” to other probabilities, even as they can be wrong)?
    
    The text you link uses Shannon’s definition of the entropy of a continuous distribution, not Jaynes’s.
    
    So? The difference only matters if you want to know the absolute (i.e. scale-invariant) magnitude of the entropy. If you’re only concerned about which distribution has the maximum entropy, you don’t need to pick an invariant measure (at least not for a case as simple as this one), and Shannon and Jaynes give the same result.
    - Cyan 8 Jan 2010 14:26 UTC
      1 point
      Parent
      
      when you are deriving probabilities through Bayesian inference, you are implicitly applying a standard of validity for probabilistic syllogisms… that matches mine
      
      I do not agree that that it what I’m doing. I don’t know why my willingness to use Bayes’ Theorem commits me to SilasBarta::validity.
      
      I hope it makes more sense now.
      
      I think I understand what you meant now. I deny that I am permitting myself the same thing as you. I try to make my problems well-structured enough that I have grounds for using a given probability distribution. I remain unconvinced that probabilistic syllogisms not attached to any particular instance have enough structure to justify a probability distribution for their elements—too much is left unspecified. Jaynes makes a related point on page 10 of “The Well-Posed Problem” at the start of section 8.
      
      Why [are you unconvinced]?
      
      Because the only argument you’ve given for it is a maxent one, and it’s not sufficient to the task, as I explain further below.
      
      If you’re only concerned about which distribution has the maximum entropy, you don’t need to pick an invariant measure (at least not for a case as simple as this one), and Shannon and Jaynes give the same result.
      
      This is not correct. The problem is that Shannon’s definition is not invariant to a change of variable. Suppose I have a square whose area is between 1 cm^2 and 4 cm^2. The Shannon-maxent distribution for the square’s area is uniform between 1 cm^2 and 4 cm^s. But such a square has sides whose lengths are between 1 cm and 2 cm. For the “side length” variable, the Shannon-maxent distribution is uniform between 1 cm and 2 cm. Of course, the two Shannon-maxent distributions are mutually inconsistent. This problem doesn’t arise when using the Jaynes definition.
      
      In your problem, suppose that, for whatever reason, I prefer the floodle scale to the probability scale, where floodle = prob + sin(2*pi*prob)/(2.1*pi). Why do I not get to apply a Shannon-maxent derivation on the floodle scale?
      - SilasBarta 8 Jan 2010 22:22 UTC
        −1 points
        Parent
        
        I do not agree that that it what I’m doing. I don’t know why my willingness to use Bayes’ Theorem commits me to SilasBarta::validity.
        
        Because you’re apparently giving the same status (“SilasBarta::validity”) to Bayesian inferences that I’m giving to the disputed syllogism S1. In what sense is it true that Bob is “probably” the murderer, given that you only know he’s been accused, and that his prints were then found on the murder weapon? Okay: in that sense I say that the conclusion of S1 is valid.
        
        Where do you think I’m saying something different?
        
        I deny that I am permitting myself the same thing as you. I try to make my problems well-structured enough that I have grounds for using a given probability distribution. I remain unconvinced that probabilistic syllogisms not attached to any particular instance have enough structure to justify a probability distribution for their elements—too much is left unspecified.
        
        What about the Bayes Theorem itself, which does exactly that (specify a probability distribution on variables not attached to any particular instance)?
        
        In your problem, suppose that, for whatever reason, I prefer the floodle scale to the probability scale, where floodle = prob + sin(2piprob)/(2.1*pi). Why do I not get to apply a Shannon-maxent derivation on the floodle scale?
        
        Because a) your information was given with the probability metric, not the floodle metric, and b) a change in variable can never be informative, while this one allows you to give yourself arbitrary information that you can’t have, by concentrating your probability on an arbitrary hypothesis.
        
        The link I gave specified that the uniform distribution maximizes entropy even for the Jaynes definition.
        Cyan 9 Jan 2010 2:53 UTC
        1 point
        Parent
        
        Because you’re apparently giving the same status (“SilasBarta::validity”) to Bayesian inferences that I’m giving to the disputed syllogism S1.
        
        For me, the necessity of using Bayesian inference follows from Cox’s Theorem, an argument which invokes no meta-probability distribution. Even if Bayesian inference turns out to have SilasBarta::validity, I would not justify it on those grounds.
        
        What about the Bayes Theorem itself, which does exactly that (specify a probability distribution on variables not attached to any particular instance)?
        
        I wouldn’t say that Bayes’ Theorem specifies a probability distribution on variables not attached to any particular instance; rather it uses consistency with classical logic to eliminate a degree of freedom in how other methods can specify otherwise arbitrary probability distributions. That is, once I’ve somehow picked a prior and a likelihood, Bayes’ Theorem shows how consistency with logic forces my posterior distribution to be proportional to the product of those two factors.
        
        Because a) your information was given with the probability metric, not the floodle metric, and b) a change in variable can never be informative, while this one allows you to give yourself arbitrary information that you can’t have, by concentrating your probability on an arbitrary hypothesis.
        
        I’m going to leave this by because it is predicated on what I believe to be a confusion about the significance of using Shannon entropy instead of Jaynes’s version.
        
        The link I gave specified that the uniform distribution maximizes entropy even for the Jaynes definition.
        
        We’re at the “is not! / is too!” stage in our dialogue, so absent something novel to the conversation, this will be my final reply on this point.
        
        The link does not so specify: this old revision shows that the example refers specifically to the Shannon definition. I believe the more general Jaynes definition was added later in the usual Wikipedia mishmash fashion, without regard to the examples listed in the article.
        
        In any event, at this point I can only direct you to the literature I regard as definitive: section 12.3 of PT:LOS (pp 374-8) (ETA: Added link—Google Books is my friend). (The math in the Wikipedia article Principle of maximum entropy follows Jaynes’s material closely. I ought to know: I wrote the bulk of it years ago.) Here’s some relevant text from that section:
        
        The conclusions, evidently, will depend on which [invariant] measure we adopt. This is the shortcoming from which the maximum entropy principle has suffered until now, and which must be cleared up before we can regard it as a full solution to the prior probability problem.
        
        Let us note the intuitive meaning of this measure. Consider the one-dimensional case, and suppose it is known that a < x < b but we have no other prior information. Then… [e]xcept for a constant factor, the measure m(x) is also the prior distribution describing ‘complete ignorance’ of x. The ambiguity is, therefore, just the ancient one which has always plagued Bayesian statistics: how do we find the prior representing ‘complete ignorance’? Once this problem is solved [emphasis added], the maximum entropy principle will lead to a definite, parameter-independent method of setting up prior distributions based on any testable prior information.
        
        SilasBarta 10 Jan 2010 3:03 UTC
        0 points
        Parent
        Y… you mean you were citing as evidence a Wikipedia article you had heavily edited? Bad Cyan! ;-)
        
        Okay, I agree we’re at a standstill. I look forward to comments you may have after I finish the article I mentioned. FWIW, the article isn’t about this specific point I’ve been defending, but rather, about the Bayesian interpretation of standard fallacy lists, where my position here falls out as a (debatable) implication.
    - Cyan 8 Jan 2010 19:35 UTC
      0 points
      Parent
      Requesting explanation for the downvote of the parent.