It wasn’t clear to me how that misses the point of the paper, and in acknowledgment of that possibility I added the caveat at the end. Hardly “obnoxious”.
Nevertheless, your original comment would be a lot more helpful if you actually summarized the point of the paper well enough that I could tell that my comment is irrelevant.
Could you edit your original post to do so? (Please don’t tell me it’s impossible. If you do, I’ll have to read the paper myself, post a summary, save everyone a lot of time, and prove you wrong.)
The point of the paper is that the reasoning behind the p-value approach to null hypothesis rejection ignores a critical factor, to wit, the ratio of the prior probability of the hypothesis to that of the data. Your s/member of Congress/Russian example shows that sometimes that factor close enough to unity that it can be ignored, but that’s not the fallacy. The fallacy is failing to account for it at all.
On second though, my original reasoning was correct, and I should have spelled it out. I’ll do so here.
It’s true that the ratio influences the result, but just the same, you can use your probability distribution of what predicates will appear in the “member of Congress” slot, over all possible propositions. It’s hard to derive, but you can come up with a number.
See, for example, Bertrand’s paradox, the question of how probable a randomly-chosen chord of a circle is of being greater than the length of side of an inscribed equilateral triangle. Some say the answer depends on how you randomly choose the chord. But as E. T. Jaynes argued#Unique_solution_using_the_.22maximum_ignorance.22_principle), the problem is well-posed as is. You just strip away any false assumptions you have of how the chord is chosen, and use the max-entropy probability distribution subject to whatever constraints are left.
Likewise, you can assume you’re being given a random syllogism of this form, weighted over the probabilities of X and Y appearing in those slots
If a person is an X, then he is probably not a Y. This person is a Y. Therefore, he is probably not an X.
It wasn’t: when a certain form of argument is asserted to be valid, it suffices to demonstrate a single counterexample to falsify the assertion. It’s kind of funny—you wrote
Valid reasoning. The problem lies in the failure to include all relevant knowledge [].
But the the failure to include all relevant knowledge is exactly why the reasoning isn’t valid.
It wasn’t: when a certain form of argument is asserted to be valid, it suffices to demonstrate a single counterexample to falsify the assertion.
Not for probabilistic claims.
It’s kind of funny—you wrote
Valid reasoning. The problem lies in the failure to include all relevant knowledge [].
But the the failure to include all relevant knowledge is exactly why the reasoning isn’t valid.
No. The reasoning can be valid even though, given additional information, the conclusion would be changed.
Example:
Bob is accused of murder. Then, Bob’s fingerprints are the only ones found on the murder weapon. Bob has an ironclad alibi: 30 witnesses and video footage of where he was.
O(guilty|accused of murder) = 1:3 P(prints on weapon|guilty) / P(prints on weapon|~guilty) = 1000 O(guilty|accused of murder, prints on weapon) = 1000*(1:3) = 1000:3 P(guilty| ….) > 99%.
If Bob is accused of murder, he has a moderate chance of being guilty. Bob’s prints are much more likely to later be the only ones found on the murder weapon if he were guilty than if he were not. Bob’s prints are the only ones on the murder weapon. Therefore, there is a very high probability Bob is guilty. Bob probably isn’t guilty. Therefore the Bayes Theorem is invalid reasoning. (???)
See the problem? The form of the reasoning presented originally is valid. That is what I was defending. But obviously, you can show the conclusion is invalid if you include additional information. In the general case, reasoning that
If a person is an X, then he is probably not a Y. This person is a Y. Therefore, he is probably not an X.
is valid, if that is all you know. But you can only invert the conclusion by assuming a higher level of knowledge than what is presented (in the quoted model above) -- specifically, that you have an additional low-entropy point in your probability distribution for “Y implies high probability of X”. But again, this assumes a probability distribution of lower entropy (higher informativeness) than you can justifiably claim to have.
So you can actually form a valid probabilistic inference without looking up the specific p(H)/p(E) ratio applying to this specific situation—just use your max entropy distribution for those values, which favors the reasoning I was defending.
I’m actually writing up an article for LW about the “Fallacy Fallacy” that touches on these issues—I think it would be worthwhile to finish it and post it. (So no, I’m not just arguing this point to save face—there’s an important lesson here that ties into the Bertrand Paradox and Jaynes’s work.)
Not really. You keep demonstrating my point as if it supports your argument, so I know we’ve got a major communication problem.
The form of the reasoning presented originally is valid. That is what I was defending.
And that’s what I’m attacking. We are using the same definition of “valid”, right? An argument is valid if and only if the conclusion follows from the premises. You’re missing the “only if” part.
It wasn’t: when a certain form of argument is asserted to be valid, it suffices to demonstrate a single counterexample to falsify the assertion.
Not for probabilistic claims.
Yes, even for probabilistic claims. See Jaynes’s policeman’s syllogism in Chapter 1 of PT:LOS for an example of a valid probabilistic argument. You can make a bunch of similarly formed probabilistic syllogisms and check them against Bayes’ Theorem to see if they’re valid. The syllogism you’re attempting to defend is
P(D|H) has a low value. D is true. Therefore, P(H|D) has a low value.
But this doesn’t follow from Bayes’ Theorem at all, and the Congress example is an explicit counterexample.
So you can actually form a valid probabilistic inference without looking up the specific p(H)/p(E) ratio applying to this specific situation—just use your max entropy distribution for those values, which favors the reasoning I was defending.
Once you know the specific H and E involved, you have to use that knowledge; whatever probability distribution you want to postulate over p(H)/p(E) is irrelevant. But even ignoring this, the idea is going to need more development before you put in into a post: Jaynes’s argument in the Bertrand problem postulates specific invariances and you’ve failed to do likewise; and as he discusses, the fact that his invariances are mutually compatible and specify a single distribution instead of a family of distributions is a happy circumstance that may or may not hold in other problems. The same sort of thing happens in maxent derivations (in continuous spaces, anyway): the constraints under which entropy is being maximized may be overspecified (mutually inconsistent) or underspecified (not sufficient to generate a normalizable distribution).
Okay, let me first try to clarify where I believe the disagreement is. If you choose to respond, please let me know which claims of mine you disagree with, and where I mischaracterize your claims.
I claim that the following syllogism S1 is valid in that it reaches a conclusion that is, on average, correct.
P(D|H) has a low value. D is true. Therefore, P(H|D) has a low value.
So, I claim, if you know nothing about what H and D are, except that the first two lines hold, your best bet (expected circumstance over all possibilities) is that the third line holds as well. You claim that the syllogism is invalid because this syllogism, S2, is invalid:
P(D|H) has a low value. D is true. P(H|D) has a high value. Therefore, P(H|D) has a low value.
I claim your argument is mistaken, because the invalidity of S2 does not imply the invalidity of S1; it’s using different premises.
(You further claim that the existence of a case where P(H|D) has a high value despite lines 1 and 2 of S1 holding, is proof that S1 is invalid. I claim that its probabilistic nature means that it doesn’t have to get the right answer (that further knowledge reveals) every time, giving a long example about murder.)
I claim that the article cited by Vladimir was claiming that S1 is an invalid syllogism. I claim that it is in error to do so, and that it was actually showing the errors that result from failing to incorporate all knowledge. So, it is not the use of the template S1 that is the problem, but failing to recognize that your template is actually S2, since your knowledge about members of congress adds the line 3 in S2.
I further claim that S1 is justified by maximum entropy inference, and that the parallels to Bertrand’s paradox were clear. I take back the latter part, and will now attempt to show why similar reasoning and invariances apply here.
Given line 1, you know that, whatever the probability distribution of D, it intersects with, at least, a small fraction of H. So draw the Venn/Euler diagram: the D circle (well, a general bounded curve, but we’ll call it a circle) could be encompassing only that small portion of H (in the member of Congress case). Or it could encompass that, and some area outside H. At the other extreme, it could encompass all of ~H. Averaging over all these possibilities, there is only a small (meta)chance that your D circle just happens to be at or very near the low end of the possibilities.
In terms of the Bayes’s theorem: P(H|D) = P(D|H)*P(H)/P(D). You know P(D|H) is low. Now here’s the problem: you claim you must account for P(H)/P(D). However, under maximum entropy assumptions, if all you know is line 1 and 2, you have a very “flat” probability distribution. As you probably agree, you cannot justify at this point, the belief that P(H) is much greater than P(D), nor that it is much less. Rather, you must smear your (meta)probability distribution on P(H) and P(D) across the range from 0 to 1. This gives an expected ratio of 1, which indeed corresponds to zero knowlege. (And, not surprisingly, the informativeness of a piece of evidence is often characterized by the absolute value of the log of the Bayes factor: the more informative, the more the ratio log-deviates from 1.)
Since your minimum knowledge assumption puts P(H)/P(D) at 1, then a small P(D|H) implies a small P(H|D). Yes, additional knowledge can over turn this. But on average, a low P(H|D) follows from applying all knowledge you have, and none that you don’t.
So, are we saying the same thing in different ways, or what? I suspect some of the confusion comes from gauging the full implications of knowing nothing about the claims H and D except for line 1 and 2.
We’re using different definitions of validity. Yours is “[a] syllogism… is valid [if] it reaches a conclusion that is, on average, correct.” Mine is this one.
ETA: Thank you taking the time to explain your position thoroughly; I’ve upvoted the parent. I’m unconvinced by your maximum entropy argument because, at the level of lack of information you’re talking about, H and D could be in continuous spaces, and in such spaces, maximum entropy only works relative to some pure non-informative measure, which has be be derived from arguments other than maximum entropy.
We’re using different definitions of validity. Yours is “[a] syllogism… is valid [if] it reaches a conclusion that is, on average, correct.” Mine is this one.
Okay, then how to you reply to my point about Bayesian reasoning in general? All Bayesian inference does is tell you what probability distribution you are justified in having, given your current level of knowledge.
With additional knowledge, that probability distribution changes. That doesn’t make your original probability assignments wrong. It doesn’t invalidate the probabilistic syllogisms you made using Bayes’s Theorem. So it seems like your definition of validity in probabilistic syllogisms matches mine.
Again, refer back to the murder example. The fact that the alibi reverses the probability of guilt resulting from the fingerprint evidence, does not mean it was invalid to assign a high probability of guilt when you only had the fingerprint evidence.
“But the alibi is additional evidence!” Yes, but so is knowledge of what H and D stand for.
I’m unconvinced by your maximum entropy argument because, at the level of lack of information you’re talking about, H and D could be in continuous spaces,
A continuous space, yes, but on a finite interval. That lets you define the max-entropy (meta)probability distribution. If q equals P(D|H) (which is low), then your (meta)distribution on P(D) is a flat line over the interval [q,1]. Most of that distribution is such that P(H|D) is also low.
I appreciate the civility with which you’ve approached this disagreement.
So it seems like your definition of validity in probabilistic syllogisms matches mine.
I only call syllogisms about probabilities valid if they follow from Bayes’ Theorem. You permit yourself a meta-probability distribution over the probabilities and call a syllogism valid if it is Cyan::valid on average w.r.t. to your meta-distribution. I’m not saying that SilasBarta::valid isn’t a possibly interesting thing to think about, but it doesn’t seem to match Cyan::valid to me.
A continuous space, yes, but on a finite interval. That lets you define the max-entropy (meta)probability distribution.
No, a finite interval is not sufficient. You really need to specify the invariant measure to use maxent in the continuous case. For instance, suppose we had a straw-throwing machine, a spinner-controlling machine, and a dart-throwing machine, each to be used to draw a chord on a circle (extending the physical experiments described here#Physical_experiments)). We have testable information about each of their accuracies and precisions. According to my understanding of Jaynes, when maximizing entropy we need to use different invariant measures for the three different machines, even though the (finite) outcome space is the same in all cases.
I only call syllogisms about probabilities valid if they follow from Bayes’ Theorem. You permit yourself a meta-probability distribution over the probabilities and call a syllogism valid if it is Cyan::valid on average w.r.t. to your meta-distribution.
But you’re permitting yourself the same thing! Whenever you apply the Bayes Theorem, you’re asserting a probability distribution to hold, even though that might not be the true generating distribution of the phenomenon. You would reject the construction of such as scenario (where your inference is way off) as a “counterexample” or somehow showing the invalidity of updates performed under the Bayes theorem. And why? Because that distribution is the best probability estimate, on average, for scenarios in which you occupy that epistemic state.
All I’m saying is that the same situation holds with respect to undefined tokens. Given that you don’t know what D and H are, and given the two premises, your best estimate of P(H|D) is low. Can you find cases where it isn’t low? Sure, but not on average. Can you find cases where it necessarily isn’t low? Sure, but they involve moving to a different epistemic state.
No, a finite interval is not sufficient. You really need to specify the invariant measure to use maxent in the continuous case
The uniform distribution on the interval [a,b] is the maximum entropy distribution among all continuous distributions which are supported in the interval [a, b] (which means that the probability density is 0 outside of the interval).
But you’re permitting yourself the same thing! Whenever you apply the Bayes Theorem...
Checks for a syllogism’s Cyan::validity do not apply Bayes’ Theorem per se. No prior and likelihood need be specified, and no posterior is calculated. The question is “can we start with Bayes’ Theorem as an equation, take whatever the premises assert about the variables in that equation (inequalities or whatever), and derive the conclusion?” Checks for SilasBarta::validity also don’t apply Bayes’ Theorem as far as I can tell—they just involve an extra element (a probability distribution for the variables of the Bayes’ Theorem equation) and an extra operation (expectation w.r.t. to the previously mentioned distribution).
You would reject the construction of such as scenario (where your inference is way off) as a “counterexample” or somehow showing the invalidity of updates performed under the Bayes theorem.
This is definitely a point of miscommunication, because I certainly never intended to impeach Bayes’ Theorem.
Given that you don’t know what D and H are, and given the two premises, your best estimate of P(H|D) is low.
Maybe. I’ve still yet to be convinced that it’s possible to derive a meta-probability distribution for the unconditional probabilities.
Wrong:
The text you link uses Shannon’s definition of the entropy of a continuous distribution, not Jaynes’s.
But you’re permitting yourself the same thing! Whenever you apply the Bayes Theorem..
Checks for a syllogism’s Cyan::validity do not apply Bayes’ Theorem per se. …
Argh. I wasn’t saying that you were using the Bayes Theorem in your claimed definition of Cyan::validity. I was saying that when you are deriving probabilities through Bayesian inference, you are implicitly applying a standard of validity for probabilistic syllogisms—a standard that matches mine, and yields the conclusion I claimed about the syllogism in question.
This is definitely a point of miscommunication, because I certainly never intended to impeach Bayes’ Theorem.
Yes, definitely a miscommunication: my point there was that the existence of cases where Bayesian inference gives you a probability differing from the true distribution are not evidence for the Bayes Theorem being invalid. I don’t know how you read it before, but that was the point, and I hope it makes more sense now.
Given that you don’t know what D and H are, and given the two premises, your best estimate of P(H|D) is low.
Maybe. I’ve still yet to be convinced that it’s possible to derive a meta-probability distribution for the unconditional probabilities.
Why? Because you don’t see how defining the variables is a kind of information you’re not allowed to have here? Because you think you can update (have a non-unity P(D)/P(H) ratio) in the absence of any information about P(D) and P(H)? Because you don’t see how the “member of Congress” case is an example of a low entropy, concentrated-probability-mass case? Because you reject meta-probabilities to begin with (in which case it’s not clear what makes probabilities found through Bayesian inference more “right” or “preferable” to other probabilities, even as they can be wrong)?
The text you link uses Shannon’s definition of the entropy of a continuous distribution, not Jaynes’s.
So? The difference only matters if you want to know the absolute (i.e. scale-invariant) magnitude of the entropy. If you’re only concerned about which distribution has the maximum entropy, you don’t need to pick an invariant measure (at least not for a case as simple as this one), and Shannon and Jaynes give the same result.
when you are deriving probabilities through Bayesian inference, you are implicitly applying a standard of validity for probabilistic syllogisms… that matches mine
I do not agree that that it what I’m doing. I don’t know why my willingness to use Bayes’ Theorem commits me to SilasBarta::validity.
I hope it makes more sense now.
I think I understand what you meant now. I deny that I am permitting myself the same thing as you. I try to make my problems well-structured enough that I have grounds for using a given probability distribution. I remain unconvinced that probabilistic syllogisms not attached to any particular instance have enough structure to justify a probability distribution for their elements—too much is left unspecified. Jaynes makes a related point on page 10 of “The Well-Posed Problem” at the start of section 8.
Why [are you unconvinced]?
Because the only argument you’ve given for it is a maxent one, and it’s not sufficient to the task, as I explain further below.
If you’re only concerned about which distribution has the maximum entropy, you don’t need to pick an invariant measure (at least not for a case as simple as this one), and Shannon and Jaynes give the same result.
This is not correct. The problem is that Shannon’s definition is not invariant to a change of variable. Suppose I have a square whose area is between 1 cm^2 and 4 cm^2. The Shannon-maxent distribution for the square’s area is uniform between 1 cm^2 and 4 cm^s. But such a square has sides whose lengths are between 1 cm and 2 cm. For the “side length” variable, the Shannon-maxent distribution is uniform between 1 cm and 2 cm. Of course, the two Shannon-maxent distributions are mutually inconsistent. This problem doesn’t arise when using the Jaynes definition.
In your problem, suppose that, for whatever reason, I prefer the floodle scale to the probability scale, where floodle = prob + sin(2*pi*prob)/(2.1*pi). Why do I not get to apply a Shannon-maxent derivation on the floodle scale?
I do not agree that that it what I’m doing. I don’t know why my willingness to use Bayes’ Theorem commits me to SilasBarta::validity.
Because you’re apparently giving the same status (“SilasBarta::validity”) to Bayesian inferences that I’m giving to the disputed syllogism S1. In what sense is it true that Bob is “probably” the murderer, given that you only know he’s been accused, and that his prints were then found on the murder weapon? Okay: in that sense I say that the conclusion of S1 is valid.
Where do you think I’m saying something different?
I deny that I am permitting myself the same thing as you. I try to make my problems well-structured enough that I have grounds for using a given probability distribution. I remain unconvinced that probabilistic syllogisms not attached to any particular instance have enough structure to justify a probability distribution for their elements—too much is left unspecified.
What about the Bayes Theorem itself, which does exactly that (specify a probability distribution on variables not attached to any particular instance)?
In your problem, suppose that, for whatever reason, I prefer the floodle scale to the probability scale, where floodle = prob + sin(2piprob)/(2.1*pi). Why do I not get to apply a Shannon-maxent derivation on the floodle scale?
Because a) your information was given with the probability metric, not the floodle metric, and b) a change in variable can never be informative, while this one allows you to give yourself arbitrary information that you can’t have, by concentrating your probability on an arbitrary hypothesis.
The link I gave specified that the uniform distribution maximizes entropy even for the Jaynes definition.
Because you’re apparently giving the same status (“SilasBarta::validity”) to Bayesian inferences that I’m giving to the disputed syllogism S1.
For me, the necessity of using Bayesian inference follows from Cox’s Theorem, an argument which invokes no meta-probability distribution. Even if Bayesian inference turns out to have SilasBarta::validity, I would not justify it on those grounds.
What about the Bayes Theorem itself, which does exactly that (specify a probability distribution on variables not attached to any particular instance)?
I wouldn’t say that Bayes’ Theorem specifies a probability distribution on variables not attached to any particular instance; rather it uses consistency with classical logic to eliminate a degree of freedom in how other methods can specify otherwise arbitrary probability distributions. That is, once I’ve somehow picked a prior and a likelihood, Bayes’ Theorem shows how consistency with logic forces my posterior distribution to be proportional to the product of those two factors.
Because a) your information was given with the probability metric, not the floodle metric, and b) a change in variable can never be informative, while this one allows you to give yourself arbitrary information that you can’t have, by concentrating your probability on an arbitrary hypothesis.
I’m going to leave this by because it is predicated on what I believe to be a confusion about the significance of using Shannon entropy instead of Jaynes’s version.
The link I gave specified that the uniform distribution maximizes entropy even for the Jaynes definition.
We’re at the “is not! / is too!” stage in our dialogue, so absent something novel to the conversation, this will be my final reply on this point.
The link does not so specify: this old revision shows that the example refers specifically to the Shannon definition. I believe the more general Jaynes definition was added later in the usual Wikipedia mishmash fashion, without regard to the examples listed in the article.
In any event, at this point I can only direct you to the literature I regard as definitive: section 12.3 of PT:LOS (pp 374-8) (ETA: Added link—Google Books is my friend). (The math in the Wikipedia article Principle of maximum entropy follows Jaynes’s material closely. I ought to know: I wrote the bulk of it years ago.) Here’s some relevant text from that section:
The conclusions, evidently, will depend on which [invariant] measure we adopt. This is the shortcoming from which the maximum entropy principle has suffered until now, and which must be cleared up before we can regard it as a full solution to the prior probability problem.
Let us note the intuitive meaning of this measure. Consider the one-dimensional case, and suppose it is known that a < x < b but we have no other prior information. Then… [e]xcept for a constant factor, the measure m(x) is also the prior distribution describing ‘complete ignorance’ of x. The ambiguity is, therefore, just the ancient one which has always plagued Bayesian statistics: how do we find the prior representing ‘complete ignorance’? Once this problem is solved [emphasis added], the maximum entropy principle will lead to a definite, parameter-independent method of setting up prior distributions based on any testable prior information.
Y… you mean you were citing as evidence a Wikipedia article you had heavily edited? Bad Cyan! ;-)
Okay, I agree we’re at a standstill. I look forward to comments you may have after I finish the article I mentioned. FWIW, the article isn’t about this specific point I’ve been defending, but rather, about the Bayesian interpretation of standard fallacy lists, where my position here falls out as a (debatable) implication.
One obstacle to understanding in this conversation seems to be that it involves the notion of “second-order probability”. That is, a probability is given to the proposition that some other proposition has a certain probability (or a probability within certain bounds).
As far as I know, this doesn’t make sense when only one epistemic agent is involved. An ideal Bayesian wouldn’t compute probabilities of the form p(x1 < p(A) < x2) for any proposition A.
Of course, if two agents are involved, then one can speak of “second-order probabilities”. One agent can assign a certain probability that the other agent assigns some probability. That is, if I use probability-function p, and you use probability function p*, then I might very well want to compute p(x1 < p*(A) < x2).
And the “two agents” here might be oneself at two different times, or one’s conscious self and one’s unconscious intuitive probability-assigning cognitive machinery.
From where I’m sitting, it looks like SilasBarta just needs to be clear that he’s using the coherent notion of “second-order probability”. Then the disagreement dissolves.
One obstacle to understanding in this conversation seems to be that it involves the notion of “second-order probability”.
Naw, that part’s cool. (I already had the idea of a meta-probability in my armamentarium.) The major obstacle to understanding was that we meant different things by the word “valid”.
As far as I know, this doesn’t make sense when only one epistemic agent is involved.
If you think there’s a fact of the matter about what p(A) is (or should be) then it makes sense. You can reason as follows: “There are some situations where I should assign an 80% probability to a. What is the probability that A is such an a?”
Unless you think “What probability should I assign to A” is entirely a different sort of question than simply “What is p(A)”.
If you think there’s a fact of the matter about what p(A) is (or should be) then it makes sense. You can reason as follows: “There are some situations where I should assign an 80% probability to a. What is the probability that A is such an a?”
I have plenty to learn about Bayesian agents, so I may be wrong. But I think that this would be a mixing of the object-language and the meta-language.
I’m supposing that a Bayesian agent evaluates probabilities p(A) where A is a sentence in a first-order logic L. So how would the agent evaluate the probability that it itself assigns a certain probability to some sentence?
We can certainly suppose that the agent’s domain of discourse D includes the numbers in the interval (0, 1) and the functions mapping sentences in L to the interval (0, 1). For each such function f let ‘f’ be a function-symbol for which f is the interpretion assigned by the agent. Similarly, for each number x in (0, 1), let ‘x’ be a constant-symbol for which x is the interpretation.
Now, how do we get the agent to evaluate the probability that p(A) = x? The natural thing to try might be to have the agent evaluate p(‘p’(A) = ‘x’). But the problem is that ‘p’(A) = ‘x’ is not a well-formed formula in L. Writing a sentence as the argument following a function symbol is not one of the valid ways to construct well-formed formulas.
That’s not excuse for not providing a meaningful summary so that others can gauge whether it’s worth their time. You need to give more than “Vladimir says so” as a reason for judging the paper worthwhile.
You … do … understand the paper well enough to provide such a summary … RIGHT?
I was linking not just to the paper, but to a summary of the paper, and included that example out of that summary, a summary-of-summary. Others have already summarized what you got wrong in your reply. You can see that the paper has about 1300 citations, which should count for its importance.
It wasn’t clear to me how that misses the point of the paper, and in acknowledgment of that possibility I added the caveat at the end. Hardly “obnoxious”.
Nevertheless, your original comment would be a lot more helpful if you actually summarized the point of the paper well enough that I could tell that my comment is irrelevant.
Could you edit your original post to do so? (Please don’t tell me it’s impossible. If you do, I’ll have to read the paper myself, post a summary, save everyone a lot of time, and prove you wrong.)
The point of the paper is that the reasoning behind the p-value approach to null hypothesis rejection ignores a critical factor, to wit, the ratio of the prior probability of the hypothesis to that of the data. Your s/member of Congress/Russian example shows that sometimes that factor close enough to unity that it can be ignored, but that’s not the fallacy. The fallacy is failing to account for it at all.
On second though, my original reasoning was correct, and I should have spelled it out. I’ll do so here.
It’s true that the ratio influences the result, but just the same, you can use your probability distribution of what predicates will appear in the “member of Congress” slot, over all possible propositions. It’s hard to derive, but you can come up with a number.
See, for example, Bertrand’s paradox, the question of how probable a randomly-chosen chord of a circle is of being greater than the length of side of an inscribed equilateral triangle. Some say the answer depends on how you randomly choose the chord. But as E. T. Jaynes argued#Unique_solution_using_the_.22maximum_ignorance.22_principle), the problem is well-posed as is. You just strip away any false assumptions you have of how the chord is chosen, and use the max-entropy probability distribution subject to whatever constraints are left.
Likewise, you can assume you’re being given a random syllogism of this form, weighted over the probabilities of X and Y appearing in those slots
If a person is an X, then he is probably not a Y.
This person is a Y.
Therefore, he is probably not an X.
It wasn’t: when a certain form of argument is asserted to be valid, it suffices to demonstrate a single counterexample to falsify the assertion. It’s kind of funny—you wrote
But the the failure to include all relevant knowledge is exactly why the reasoning isn’t valid.
Not for probabilistic claims.
No. The reasoning can be valid even though, given additional information, the conclusion would be changed.
Example:
Bob is accused of murder.
Then, Bob’s fingerprints are the only ones found on the murder weapon.
Bob has an ironclad alibi: 30 witnesses and video footage of where he was.
O(guilty|accused of murder) = 1:3
P(prints on weapon|guilty) / P(prints on weapon|~guilty) = 1000
O(guilty|accused of murder, prints on weapon) = 1000*(1:3) = 1000:3
P(guilty| ….) > 99%.
If Bob is accused of murder, he has a moderate chance of being guilty.
Bob’s prints are much more likely to later be the only ones found on the murder weapon if he were guilty than if he were not.
Bob’s prints are the only ones on the murder weapon.
Therefore, there is a very high probability Bob is guilty.
Bob probably isn’t guilty.
Therefore the Bayes Theorem is invalid reasoning. (???)
See the problem? The form of the reasoning presented originally is valid. That is what I was defending. But obviously, you can show the conclusion is invalid if you include additional information. In the general case, reasoning that
is valid, if that is all you know. But you can only invert the conclusion by assuming a higher level of knowledge than what is presented (in the quoted model above) -- specifically, that you have an additional low-entropy point in your probability distribution for “Y implies high probability of X”. But again, this assumes a probability distribution of lower entropy (higher informativeness) than you can justifiably claim to have.
So you can actually form a valid probabilistic inference without looking up the specific p(H)/p(E) ratio applying to this specific situation—just use your max entropy distribution for those values, which favors the reasoning I was defending.
I’m actually writing up an article for LW about the “Fallacy Fallacy” that touches on these issues—I think it would be worthwhile to finish it and post it. (So no, I’m not just arguing this point to save face—there’s an important lesson here that ties into the Bertrand Paradox and Jaynes’s work.)
Not really. You keep demonstrating my point as if it supports your argument, so I know we’ve got a major communication problem.
And that’s what I’m attacking. We are using the same definition of “valid”, right? An argument is valid if and only if the conclusion follows from the premises. You’re missing the “only if” part.
Yes, even for probabilistic claims. See Jaynes’s policeman’s syllogism in Chapter 1 of PT:LOS for an example of a valid probabilistic argument. You can make a bunch of similarly formed probabilistic syllogisms and check them against Bayes’ Theorem to see if they’re valid. The syllogism you’re attempting to defend is
P(D|H) has a low value.
D is true.
Therefore, P(H|D) has a low value.
But this doesn’t follow from Bayes’ Theorem at all, and the Congress example is an explicit counterexample.
Once you know the specific H and E involved, you have to use that knowledge; whatever probability distribution you want to postulate over p(H)/p(E) is irrelevant. But even ignoring this, the idea is going to need more development before you put in into a post: Jaynes’s argument in the Bertrand problem postulates specific invariances and you’ve failed to do likewise; and as he discusses, the fact that his invariances are mutually compatible and specify a single distribution instead of a family of distributions is a happy circumstance that may or may not hold in other problems. The same sort of thing happens in maxent derivations (in continuous spaces, anyway): the constraints under which entropy is being maximized may be overspecified (mutually inconsistent) or underspecified (not sufficient to generate a normalizable distribution).
Okay, let me first try to clarify where I believe the disagreement is. If you choose to respond, please let me know which claims of mine you disagree with, and where I mischaracterize your claims.
I claim that the following syllogism S1 is valid in that it reaches a conclusion that is, on average, correct.
So, I claim, if you know nothing about what H and D are, except that the first two lines hold, your best bet (expected circumstance over all possibilities) is that the third line holds as well. You claim that the syllogism is invalid because this syllogism, S2, is invalid:
I claim your argument is mistaken, because the invalidity of S2 does not imply the invalidity of S1; it’s using different premises.
(You further claim that the existence of a case where P(H|D) has a high value despite lines 1 and 2 of S1 holding, is proof that S1 is invalid. I claim that its probabilistic nature means that it doesn’t have to get the right answer (that further knowledge reveals) every time, giving a long example about murder.)
I claim that the article cited by Vladimir was claiming that S1 is an invalid syllogism. I claim that it is in error to do so, and that it was actually showing the errors that result from failing to incorporate all knowledge. So, it is not the use of the template S1 that is the problem, but failing to recognize that your template is actually S2, since your knowledge about members of congress adds the line 3 in S2.
I further claim that S1 is justified by maximum entropy inference, and that the parallels to Bertrand’s paradox were clear. I take back the latter part, and will now attempt to show why similar reasoning and invariances apply here.
Given line 1, you know that, whatever the probability distribution of D, it intersects with, at least, a small fraction of H. So draw the Venn/Euler diagram: the D circle (well, a general bounded curve, but we’ll call it a circle) could be encompassing only that small portion of H (in the member of Congress case). Or it could encompass that, and some area outside H. At the other extreme, it could encompass all of ~H. Averaging over all these possibilities, there is only a small (meta)chance that your D circle just happens to be at or very near the low end of the possibilities.
In terms of the Bayes’s theorem: P(H|D) = P(D|H)*P(H)/P(D). You know P(D|H) is low. Now here’s the problem: you claim you must account for P(H)/P(D). However, under maximum entropy assumptions, if all you know is line 1 and 2, you have a very “flat” probability distribution. As you probably agree, you cannot justify at this point, the belief that P(H) is much greater than P(D), nor that it is much less. Rather, you must smear your (meta)probability distribution on P(H) and P(D) across the range from 0 to 1. This gives an expected ratio of 1, which indeed corresponds to zero knowlege. (And, not surprisingly, the informativeness of a piece of evidence is often characterized by the absolute value of the log of the Bayes factor: the more informative, the more the ratio log-deviates from 1.)
Since your minimum knowledge assumption puts P(H)/P(D) at 1, then a small P(D|H) implies a small P(H|D). Yes, additional knowledge can over turn this. But on average, a low P(H|D) follows from applying all knowledge you have, and none that you don’t.
So, are we saying the same thing in different ways, or what? I suspect some of the confusion comes from gauging the full implications of knowing nothing about the claims H and D except for line 1 and 2.
We’re using different definitions of validity. Yours is “[a] syllogism… is valid [if] it reaches a conclusion that is, on average, correct.” Mine is this one.
ETA: Thank you taking the time to explain your position thoroughly; I’ve upvoted the parent. I’m unconvinced by your maximum entropy argument because, at the level of lack of information you’re talking about, H and D could be in continuous spaces, and in such spaces, maximum entropy only works relative to some pure non-informative measure, which has be be derived from arguments other than maximum entropy.
Okay, then how to you reply to my point about Bayesian reasoning in general? All Bayesian inference does is tell you what probability distribution you are justified in having, given your current level of knowledge.
With additional knowledge, that probability distribution changes. That doesn’t make your original probability assignments wrong. It doesn’t invalidate the probabilistic syllogisms you made using Bayes’s Theorem. So it seems like your definition of validity in probabilistic syllogisms matches mine.
Again, refer back to the murder example. The fact that the alibi reverses the probability of guilt resulting from the fingerprint evidence, does not mean it was invalid to assign a high probability of guilt when you only had the fingerprint evidence.
“But the alibi is additional evidence!” Yes, but so is knowledge of what H and D stand for.
A continuous space, yes, but on a finite interval. That lets you define the max-entropy (meta)probability distribution. If q equals P(D|H) (which is low), then your (meta)distribution on P(D) is a flat line over the interval [q,1]. Most of that distribution is such that P(H|D) is also low.
I appreciate the civility with which you’ve approached this disagreement.
I only call syllogisms about probabilities valid if they follow from Bayes’ Theorem. You permit yourself a meta-probability distribution over the probabilities and call a syllogism valid if it is Cyan::valid on average w.r.t. to your meta-distribution. I’m not saying that SilasBarta::valid isn’t a possibly interesting thing to think about, but it doesn’t seem to match Cyan::valid to me.
No, a finite interval is not sufficient. You really need to specify the invariant measure to use maxent in the continuous case. For instance, suppose we had a straw-throwing machine, a spinner-controlling machine, and a dart-throwing machine, each to be used to draw a chord on a circle (extending the physical experiments described here#Physical_experiments)). We have testable information about each of their accuracies and precisions. According to my understanding of Jaynes, when maximizing entropy we need to use different invariant measures for the three different machines, even though the (finite) outcome space is the same in all cases.
But you’re permitting yourself the same thing! Whenever you apply the Bayes Theorem, you’re asserting a probability distribution to hold, even though that might not be the true generating distribution of the phenomenon. You would reject the construction of such as scenario (where your inference is way off) as a “counterexample” or somehow showing the invalidity of updates performed under the Bayes theorem. And why? Because that distribution is the best probability estimate, on average, for scenarios in which you occupy that epistemic state.
All I’m saying is that the same situation holds with respect to undefined tokens. Given that you don’t know what D and H are, and given the two premises, your best estimate of P(H|D) is low. Can you find cases where it isn’t low? Sure, but not on average. Can you find cases where it necessarily isn’t low? Sure, but they involve moving to a different epistemic state.
Wrong:
Checks for a syllogism’s Cyan::validity do not apply Bayes’ Theorem per se. No prior and likelihood need be specified, and no posterior is calculated. The question is “can we start with Bayes’ Theorem as an equation, take whatever the premises assert about the variables in that equation (inequalities or whatever), and derive the conclusion?” Checks for SilasBarta::validity also don’t apply Bayes’ Theorem as far as I can tell—they just involve an extra element (a probability distribution for the variables of the Bayes’ Theorem equation) and an extra operation (expectation w.r.t. to the previously mentioned distribution).
This is definitely a point of miscommunication, because I certainly never intended to impeach Bayes’ Theorem.
Maybe. I’ve still yet to be convinced that it’s possible to derive a meta-probability distribution for the unconditional probabilities.
The text you link uses Shannon’s definition of the entropy of a continuous distribution, not Jaynes’s.
Argh. I wasn’t saying that you were using the Bayes Theorem in your claimed definition of Cyan::validity. I was saying that when you are deriving probabilities through Bayesian inference, you are implicitly applying a standard of validity for probabilistic syllogisms—a standard that matches mine, and yields the conclusion I claimed about the syllogism in question.
Yes, definitely a miscommunication: my point there was that the existence of cases where Bayesian inference gives you a probability differing from the true distribution are not evidence for the Bayes Theorem being invalid. I don’t know how you read it before, but that was the point, and I hope it makes more sense now.
Why? Because you don’t see how defining the variables is a kind of information you’re not allowed to have here? Because you think you can update (have a non-unity P(D)/P(H) ratio) in the absence of any information about P(D) and P(H)? Because you don’t see how the “member of Congress” case is an example of a low entropy, concentrated-probability-mass case? Because you reject meta-probabilities to begin with (in which case it’s not clear what makes probabilities found through Bayesian inference more “right” or “preferable” to other probabilities, even as they can be wrong)?
So? The difference only matters if you want to know the absolute (i.e. scale-invariant) magnitude of the entropy. If you’re only concerned about which distribution has the maximum entropy, you don’t need to pick an invariant measure (at least not for a case as simple as this one), and Shannon and Jaynes give the same result.
I do not agree that that it what I’m doing. I don’t know why my willingness to use Bayes’ Theorem commits me to SilasBarta::validity.
I think I understand what you meant now. I deny that I am permitting myself the same thing as you. I try to make my problems well-structured enough that I have grounds for using a given probability distribution. I remain unconvinced that probabilistic syllogisms not attached to any particular instance have enough structure to justify a probability distribution for their elements—too much is left unspecified. Jaynes makes a related point on page 10 of “The Well-Posed Problem” at the start of section 8.
Because the only argument you’ve given for it is a maxent one, and it’s not sufficient to the task, as I explain further below.
This is not correct. The problem is that Shannon’s definition is not invariant to a change of variable. Suppose I have a square whose area is between 1 cm^2 and 4 cm^2. The Shannon-maxent distribution for the square’s area is uniform between 1 cm^2 and 4 cm^s. But such a square has sides whose lengths are between 1 cm and 2 cm. For the “side length” variable, the Shannon-maxent distribution is uniform between 1 cm and 2 cm. Of course, the two Shannon-maxent distributions are mutually inconsistent. This problem doesn’t arise when using the Jaynes definition.
In your problem, suppose that, for whatever reason, I prefer the floodle scale to the probability scale, where floodle = prob + sin(2*pi*prob)/(2.1*pi). Why do I not get to apply a Shannon-maxent derivation on the floodle scale?
Because you’re apparently giving the same status (“SilasBarta::validity”) to Bayesian inferences that I’m giving to the disputed syllogism S1. In what sense is it true that Bob is “probably” the murderer, given that you only know he’s been accused, and that his prints were then found on the murder weapon? Okay: in that sense I say that the conclusion of S1 is valid.
Where do you think I’m saying something different?
What about the Bayes Theorem itself, which does exactly that (specify a probability distribution on variables not attached to any particular instance)?
Because a) your information was given with the probability metric, not the floodle metric, and b) a change in variable can never be informative, while this one allows you to give yourself arbitrary information that you can’t have, by concentrating your probability on an arbitrary hypothesis.
The link I gave specified that the uniform distribution maximizes entropy even for the Jaynes definition.
For me, the necessity of using Bayesian inference follows from Cox’s Theorem, an argument which invokes no meta-probability distribution. Even if Bayesian inference turns out to have SilasBarta::validity, I would not justify it on those grounds.
I wouldn’t say that Bayes’ Theorem specifies a probability distribution on variables not attached to any particular instance; rather it uses consistency with classical logic to eliminate a degree of freedom in how other methods can specify otherwise arbitrary probability distributions. That is, once I’ve somehow picked a prior and a likelihood, Bayes’ Theorem shows how consistency with logic forces my posterior distribution to be proportional to the product of those two factors.
I’m going to leave this by because it is predicated on what I believe to be a confusion about the significance of using Shannon entropy instead of Jaynes’s version.
We’re at the “is not! / is too!” stage in our dialogue, so absent something novel to the conversation, this will be my final reply on this point.
The link does not so specify: this old revision shows that the example refers specifically to the Shannon definition. I believe the more general Jaynes definition was added later in the usual Wikipedia mishmash fashion, without regard to the examples listed in the article.
In any event, at this point I can only direct you to the literature I regard as definitive: section 12.3 of PT:LOS (pp 374-8) (ETA: Added link—Google Books is my friend). (The math in the Wikipedia article Principle of maximum entropy follows Jaynes’s material closely. I ought to know: I wrote the bulk of it years ago.) Here’s some relevant text from that section:
Y… you mean you were citing as evidence a Wikipedia article you had heavily edited? Bad Cyan! ;-)
Okay, I agree we’re at a standstill. I look forward to comments you may have after I finish the article I mentioned. FWIW, the article isn’t about this specific point I’ve been defending, but rather, about the Bayesian interpretation of standard fallacy lists, where my position here falls out as a (debatable) implication.
Requesting explanation for the downvote of the parent.
One obstacle to understanding in this conversation seems to be that it involves the notion of “second-order probability”. That is, a probability is given to the proposition that some other proposition has a certain probability (or a probability within certain bounds).
As far as I know, this doesn’t make sense when only one epistemic agent is involved. An ideal Bayesian wouldn’t compute probabilities of the form p(x1 < p(A) < x2) for any proposition A.
Of course, if two agents are involved, then one can speak of “second-order probabilities”. One agent can assign a certain probability that the other agent assigns some probability. That is, if I use probability-function p, and you use probability function p*, then I might very well want to compute p(x1 < p*(A) < x2).
And the “two agents” here might be oneself at two different times, or one’s conscious self and one’s unconscious intuitive probability-assigning cognitive machinery.
From where I’m sitting, it looks like SilasBarta just needs to be clear that he’s using the coherent notion of “second-order probability”. Then the disagreement dissolves.
Naw, that part’s cool. (I already had the idea of a meta-probability in my armamentarium.) The major obstacle to understanding was that we meant different things by the word “valid”.
If you think there’s a fact of the matter about what p(A) is (or should be) then it makes sense. You can reason as follows: “There are some situations where I should assign an 80% probability to a. What is the probability that A is such an a?”
Unless you think “What probability should I assign to A” is entirely a different sort of question than simply “What is p(A)”.
I have plenty to learn about Bayesian agents, so I may be wrong. But I think that this would be a mixing of the object-language and the meta-language.
I’m supposing that a Bayesian agent evaluates probabilities p(A) where A is a sentence in a first-order logic L. So how would the agent evaluate the probability that it itself assigns a certain probability to some sentence?
We can certainly suppose that the agent’s domain of discourse D includes the numbers in the interval (0, 1) and the functions mapping sentences in L to the interval (0, 1). For each such function f let ‘f’ be a function-symbol for which f is the interpretion assigned by the agent. Similarly, for each number x in (0, 1), let ‘x’ be a constant-symbol for which x is the interpretation.
Now, how do we get the agent to evaluate the probability that p(A) = x? The natural thing to try might be to have the agent evaluate p(‘p’(A) = ‘x’). But the problem is that ‘p’(A) = ‘x’ is not a well-formed formula in L. Writing a sentence as the argument following a function symbol is not one of the valid ways to construct well-formed formulas.
Wouldn’t I say that to be for the best, given that I started the thread by linking to the paper?
That’s not excuse for not providing a meaningful summary so that others can gauge whether it’s worth their time. You need to give more than “Vladimir says so” as a reason for judging the paper worthwhile.
You … do … understand the paper well enough to provide such a summary … RIGHT?
I was linking not just to the paper, but to a summary of the paper, and included that example out of that summary, a summary-of-summary. Others have already summarized what you got wrong in your reply. You can see that the paper has about 1300 citations, which should count for its importance.