My reason for writing this is not to correct Eliezer. Rather, I want to expand on his distinction between prior information and prior probability. Pages 87-89 of Probability Theory: the Logic of Science by E. T. Jaynes (2004 reprint with corrections, ISBN 0 521 59271 2) is dense with important definitions and principles. The quotes below are from there, unless otherwise indicated.
Jaynes writes the fundamental law of inference as
P(H|DX) = P(H|X) P(D|HX) / P(D|X) (4.3)
Which the reader may be more used to seeing as
P(H|D) = P(H) P(D|H) / P(D)
Where
H = some hypothesis to be tested
D = the data under immediate consideration
X = all other information known
X is the misleadingly-named ‘prior information’, which represents all the information available other than the specific data D that we are considering at the moment. “This includes, at the very least, all it’s past experiences, from the time it left the factory to the time it received its current problem.”—Jaynes p.87, referring to a hypothetical problem-solving robot. It seems to me that in practice, X ends up being a representation of a subset of all prior experience, attempting to discard only what is irrelevant to the problem. In real human practice, that representation may be wrong and may need to be corrected.
“ … to our robot, there is no such thing as an ‘absolute’ probability; all probabilities are necessarily conditional on X at the least.” “Any probability P(A|X) which is conditional on X alone is called a prior probability. But we caution that ‘prior’ … does not necessarily mean ‘earlier in time’ … the distinction is purely a logical one; any information beyond the immediate data D of the current problem is by definition ‘prior information’.”
“Indeed, the separation of the totality of the evidence into two components called ‘data’ and ‘prior information’ is an arbitrary choice made by us, only for our convenience in organizing a chain of inferences.” Please note his use of the word ‘evidence’.
Sampling theory, which is the basis of many treatments of probability, “ … did not need to take any particular note of the prior information X, because all probabilities were conditional on H, and so we could suppose implicitly that the general verbal prior information defining the problem was included in H. This is the habit of notation that we have slipped into, which has obscured the unified nature of all inference.”
“From the start, it has seemed clear how one how one determines numerical values of of sampling probabilities¹ [e.g. P(D|H) ], but not what determines prior probabilities [AKA ‘priors’ e.g. P(H|X)]. In the present work we shall see that this s only an artifact of the unsymmetrical way of formulating problems, which left them ill-posed. One could see clearly how to assign sampling probabilities because the hypothesis H was stated very specifically; had the prior information X been specified equally well, it would have been equally clear how to assign prior probabilities.”
Jaynes never gives up on that X notation (though the letter may differ), he never drops it for convenience.
“When we look at these problems on a sufficiently fundamental level and realize how careful one must be to specify prior information before we have a well-posed problem, it becomes clear that … exactly the same principles are needed to assign either sampling probabilities or prior probabilities …” That is, P(H|X) should be calculated. Keep your copy of Kendall and Stuart handy.
I think priors should not be cheaply set from an opinion, whim, or wish. “ … it would be a big mistake to think of X as standing for some hidden major premise, or some universally valid proposition about Nature.”
The prior information has impact beyond setting prior probabilities (priors). It informs the formulation of the hypotheses, of the model, and of “alternative hypotheses” that come to mind when the data seem to be showing something really strange. For example, data that seems to strongly support psychokinesis may cause a skeptic to bring up a hypothesis of fraud, whereas a career psychic researcher may not do so. (see Jaynes pp.122-125)
I say, be alert for misinformation, biases, and wishful thinking in your X. Discard everything that is not evidence.
There are massive compendiums of methods for sampling distributions, such as
Feller (An Introduction to Probability Theory and its Applications, Vol1, J. Wiley & Sons, New York, 3rd edn 1968 and Vol 2. J. Wiley & Sons, New York, 2nd edn 1971) and Kendall and
Stuart (The Advanced Theory of Statistics: Volume 1, Distribution Theory, McMillan, New York 1977). ** Be familiar with what is in them.
Edited 05/05/2010 to put in the actual references.
My reason for writing this is not to correct Eliezer. Rather, I want to expand on his distinction between prior information and prior probability. Pages 87-89 of Probability Theory: the Logic of Science by E. T. Jaynes (2004 reprint with corrections, ISBN 0 521 59271 2) is dense with important definitions and principles. The quotes below are from there, unless otherwise indicated.
Jaynes writes the fundamental law of inference as
Which the reader may be more used to seeing as
Where
X is the misleadingly-named ‘prior information’, which represents all the information available other than the specific data D that we are considering at the moment. “This includes, at the very least, all it’s past experiences, from the time it left the factory to the time it received its current problem.”—Jaynes p.87, referring to a hypothetical problem-solving robot. It seems to me that in practice, X ends up being a representation of a subset of all prior experience, attempting to discard only what is irrelevant to the problem. In real human practice, that representation may be wrong and may need to be corrected.
“ … to our robot, there is no such thing as an ‘absolute’ probability; all probabilities are necessarily conditional on X at the least.” “Any probability P(A|X) which is conditional on X alone is called a prior probability. But we caution that ‘prior’ … does not necessarily mean ‘earlier in time’ … the distinction is purely a logical one; any information beyond the immediate data D of the current problem is by definition ‘prior information’.”
“Indeed, the separation of the totality of the evidence into two components called ‘data’ and ‘prior information’ is an arbitrary choice made by us, only for our convenience in organizing a chain of inferences.” Please note his use of the word ‘evidence’.
Sampling theory, which is the basis of many treatments of probability, “ … did not need to take any particular note of the prior information X, because all probabilities were conditional on H, and so we could suppose implicitly that the general verbal prior information defining the problem was included in H. This is the habit of notation that we have slipped into, which has obscured the unified nature of all inference.”
“From the start, it has seemed clear how one how one determines numerical values of of sampling probabilities¹ [e.g. P(D|H) ], but not what determines prior probabilities [AKA ‘priors’ e.g. P(H|X)]. In the present work we shall see that this s only an artifact of the unsymmetrical way of formulating problems, which left them ill-posed. One could see clearly how to assign sampling probabilities because the hypothesis H was stated very specifically; had the prior information X been specified equally well, it would have been equally clear how to assign prior probabilities.”
Jaynes never gives up on that X notation (though the letter may differ), he never drops it for convenience.
“When we look at these problems on a sufficiently fundamental level and realize how careful one must be to specify prior information before we have a well-posed problem, it becomes clear that … exactly the same principles are needed to assign either sampling probabilities or prior probabilities …” That is, P(H|X) should be calculated. Keep your copy of Kendall and Stuart handy.
I think priors should not be cheaply set from an opinion, whim, or wish. “ … it would be a big mistake to think of X as standing for some hidden major premise, or some universally valid proposition about Nature.”
The prior information has impact beyond setting prior probabilities (priors). It informs the formulation of the hypotheses, of the model, and of “alternative hypotheses” that come to mind when the data seem to be showing something really strange. For example, data that seems to strongly support psychokinesis may cause a skeptic to bring up a hypothesis of fraud, whereas a career psychic researcher may not do so. (see Jaynes pp.122-125)
I say, be alert for misinformation, biases, and wishful thinking in your X. Discard everything that is not evidence.
I’m pretty sure the free version Probability Theory: The Logic of Science is off line. You can preview the book here: http://books.google.com/books?id=tTN4HuUNXjgC&printsec=frontcover&dq=Probability+Theory:+The+Logic+of+Science&cd=1#v=onepage&q&f=false .
Also see the Unofficial Errata and Commentary for E. T. Jaynes’s Probability Theory: The Logic of Science
SEE ALSO
Priors
Probability is Subjectively Objective
FOOTNOTES
There are massive compendiums of methods for sampling distributions, such as
Feller (An Introduction to Probability Theory and its Applications, Vol1, J. Wiley & Sons, New York, 3rd edn 1968 and Vol 2. J. Wiley & Sons, New York, 2nd edn 1971) and Kendall and
Stuart (The Advanced Theory of Statistics: Volume 1, Distribution Theory, McMillan, New York 1977).
** Be familiar with what is in them.
Edited 05/05/2010 to put in the actual references.
Edited 05/19/2010 to put in SEE ALSO