So I agree with most of what you say here, and as a Metaculus user I have some sympathy for trying to make proper scoring rules the epistemological basis of “probability-speak”. There are some problems with it, like different proper scoring rules give different incentives to people when it comes to distributing finite resources across many questions to acquire info about them, but broadly I think the norm of scoring models (or even individual forecasters) by their Brier score or log score and trying to maximize your own score is a good norm.
There are probably other issues, but the immediate problem for me is that this way of bootstrapping probabilistic theories seems to be circular. Given that you accept the whole Bayesian framework already, it’s obvious that under this meta-normative theory you’re supposed to report your true credence for any event because that’s what will maximize your expected log score. This is perfectly consistent but the proper scoring rule appears to be superfluous if you already are a Bayesian. However, if you don’t already accept the Bayesian way of looking at the problem then “maximize S(A)=logP(A)” is useless advice: S is a function from the states of the world to the real numbers and there’s no total order on that space for you to use for this maximization problem. In practice we would act like Bayesians and this would work, but then we’re right back where we started because we’re using probabilities when they don’t seem to add any epistemic content.
There are other versions of this which I’ve mentioned in other comments: for example you can have a norm of “try to make money by betting on stuff” and you can use a Dutch book argument to show that contingent claim prices are going to give you a probability measure. While that justifies the use of some probabilities with a fairly natural sounding norm, it doesn’t explain what I’m doing when I price these contingent claims or what the funny numbers I get as a result of this process actually mean. (It also leads to some paradoxes when the contingent claim payoffs are correlated with your marginal utility, but I’m setting that issue aside here.)
My central point of disagreement with your answer is that I don’t think “claims must be either True or False” is a meta-normative intuition and I think it can’t be necessary to abandon the law of excluded middle in order to justify the use of probabilities. In fact, even the proper scoring rule approach you outline doesn’t really throw out the law of excluded middle, because unless there’s some point at which the question will resolve as either True or False there’s no reason for you to report your “true credence” to maximize your expected score and so the whole edifice falls apart.
the immediate problem for me is that this way of bootstrapping probabilistic theories seems to be circular.
I think it is not circular, though I can imagine why it seems so. Let me try to elaborate the order of operations as I see it.
Syntax: Accept that a probability-sentence like “P(there will be a sea-battle tomorrow) ≥ 0.4” is at least syntactically parseable, i.e. not gibberish, even if it is semantically disqualified from being true (like “the present King of France is a human”).
This can be formalized as adding a new term-former P:ClassicalSentence→ProbabilityTerm, other term-formers such as +:ProbabilityTerm×ProbabilityTerm→ProbabilityTerm, constants C:Q→ProbabilityTerm, and finally a predicate ≥0:ProbabilityTerm→ProbabilitySentence.
Logic: Accept that probability-sentences can be the premises and/or conclusions of valid deductions, such as P(A)≥0.4,P(B∧A)≥0.5⋅P(A)⊢P(B)≥0.2.
Axiomatizing the valid deductions in a sound and complete way is not as easy as it may seem, because of the interaction with various expressive features one might want (native conditional probabilities, higher-order probabilities, polynomial inequalities) and model-theoretic and complexity-theoretic issues (pathological models, undecidable satisfiability). Some contenders:
LPWF, which has polynomial inequalities but not higher-order probabilities
LCP, which has higher-order conditional probabilities but not inequalities
LPP2, which has neither, but has decidable satisfiability.
Anyway, the basic axioms about probability that we need for such logics are:
P(α)≥0
P(⊤)=1
P(⊥)=0
P(α)+P(β)=P(α∨β)+P(α∧β)
P(α↔β)=1→P(α)=P(β)
α⊢P(α)=1
Those axioms can, if you wish, be derived from much weaker principles by Cox-style theorems. It’s important to admit that Cox’s proof of his original theorem (as cited by Jaynes) was mistaken, so there isn’t actually a “Cox’s theorem”, but rather a family of variants that actually work given different assumed principles. My favorite is Van Horn 2017, which uses only the following principles:
Equivalence-invariance: If X↔Y and X→(A↔B), then c(A|X)=c(B|Y).
Definition-invariance: If s is an atomic proposition not appearing in A, X, or E, then c(A|X)=c(A|X∧(s↔E)).
Irrelevance-invariance: If Y is a noncontradictory formula sharing no symbols with either X or A, then c(A|X)=c(A|X∧Y).
Implication-compatibility: If X→(A→B) but not X→(B→A), then c(A|X)<c(B|X).
Epistemics: Revise the Aristotelian norms, as follows:
Instead of demanding that a rational speaker either assert or deny any classical sentence about relevant propositions, instead demand that
a rational speaker assert or deny any probability-sentence about relevant propositions, and that
all their assertions be coherent, in the sense that probability-logic cannot deduce ⊥ from any subset of them.
Instead of classifying a speaker as either correct or incorrect (depending on whether they assert what is and deny what is not or deny what is and assert what is not), score them on the basis of the greatest rational q for which they asserted P(A)−q≥0 (where A is the conjunction of all of “what is”, or rather what is observed), and award them logq points.
The logq rule in particular can be justified and characterized at this stage just by the property of invariance under observation orderings, i.e. logP(A0)+logP(A1|A0)=logP(A1) (discussed more below)
Decision theory: Optionally, you can now assume the vNM axioms on top of the probabilistic logic, prove the vNM theorem, formalize a speech-act game internalizing the logP(A) rule, and then prove a revelation theorem that says that the optimal policy for obtaining epistemic points is to report one’s actual internal beliefs.
I think the key confusion here is that it may seem like one needs the decision theory set up already in order to justify the scoring rule (to establish that it incentivizes honest revelation), but the decision theory also depends on the scoring rule. I claim that the scoring rule can be justified on other grounds than honest revelation. If you don’t buy the argument of invariance under observation orderings, I can probably come up with other justifications, e.g. from coding theory. Closing the decision-theoretic loop also does provide some justificatory force, even if it is circular, since being able to set up a revelation theorem is certainly a nice feature of this logP(A) norm.
But fundamentally, whether in this system or Aristotle’s, one doesn’t identify the epistemic norms by trying to incentivize honest reporting of beliefs, but rather by trying to validate reports that align with reality. The logP(A) rule stands as a way of extending the desire for reports that align with reality to the non-Boolean logic of probability, so that we can talk rationally about sea-battles and other uncertain events, without having to think about in what order we find things out.
different proper scoring rules give different incentives to people when it comes to distributing finite resources across many questions to acquire info about them
I haven’t studied this difference, but I want to register my initial intuition that to the extent other proper scoring rules give different value-of-information incentives than the log scoring rule, they are worse and the incentives from the log rule are better. In particular, I expect the incentives of the log rule to be more invariant to different ways of asking multiple questions that basically add up to one composite problem domain, and that being sensitive to that would be a misfeature.
In fact, even the proper scoring rule approach you outline doesn’t really throw out the law of excluded middle, because unless there’s some point at which the question will resolve as either True or False there’s no reason for you to report your “true credence” to maximize your expected score and so the whole edifice falls apart.
Even if a question never resolves fully enough to make all observables either True or False (i.e., if the possibility space is Hausdorff, resolves to a Dirac delta), but just resolves incrementally to more and more precise observations A0⊃A1⊃⋯⊃Ak⊃⋯, the log scoring rule remains proper, since
I don’t think the same can be said for the Brier scoring rule; it doesn’t even seem to have a well-defined generalization to this case.
There are a couple fiddly assumptions here I should bring out explicitly:
when it comes to epistemic value, we should have a temporal discount factor of γ=1, very much unlike prudential or ethical values where I argue the discount factor must be γ<1.
If we don’t do this, then we get an incentive to smear out our forecasts to the extent we expect high precision to take a long time to obtain.
This is one reason to keep epistemic value as a separate normative domain from other kinds of value.
The point you mentioned parenthetically about contingencies correlating with marginal utility is another reason to keep utility separate from epistemic value.
When we decide what probabilistic statements to make, we should act as-if either the question will eventually resolve fully, or “there will always be more to discover” and that more is always discovered eventually.
Big tangent: There is a resonance here with CEV, where we try to imagine an infinite future limit of all ethical knowledge having been learned, and judge our current intentions by that standard, without discounting it for being far in the future, or discounting the whole scenario for being less-than-certain that ethical beings will survive and continue their ethical development indefinitely or until there is nothing more to learn.
Here we are sort-of in the domain of ethics, where I’d say temporal discounting is necessary, but methodologically the question of how to determine ethical value is an epistemic one. So we shouldn’t discount future ethical-knowledge Bayes-points, but we can still discount object-level ethical value.
So I agree with most of what you say here, and as a Metaculus user I have some sympathy for trying to make proper scoring rules the epistemological basis of “probability-speak”. There are some problems with it, like different proper scoring rules give different incentives to people when it comes to distributing finite resources across many questions to acquire info about them, but broadly I think the norm of scoring models (or even individual forecasters) by their Brier score or log score and trying to maximize your own score is a good norm.
There are probably other issues, but the immediate problem for me is that this way of bootstrapping probabilistic theories seems to be circular. Given that you accept the whole Bayesian framework already, it’s obvious that under this meta-normative theory you’re supposed to report your true credence for any event because that’s what will maximize your expected log score. This is perfectly consistent but the proper scoring rule appears to be superfluous if you already are a Bayesian. However, if you don’t already accept the Bayesian way of looking at the problem then “maximize S(A)=logP(A)” is useless advice: S is a function from the states of the world to the real numbers and there’s no total order on that space for you to use for this maximization problem. In practice we would act like Bayesians and this would work, but then we’re right back where we started because we’re using probabilities when they don’t seem to add any epistemic content.
There are other versions of this which I’ve mentioned in other comments: for example you can have a norm of “try to make money by betting on stuff” and you can use a Dutch book argument to show that contingent claim prices are going to give you a probability measure. While that justifies the use of some probabilities with a fairly natural sounding norm, it doesn’t explain what I’m doing when I price these contingent claims or what the funny numbers I get as a result of this process actually mean. (It also leads to some paradoxes when the contingent claim payoffs are correlated with your marginal utility, but I’m setting that issue aside here.)
My central point of disagreement with your answer is that I don’t think “claims must be either True or False” is a meta-normative intuition and I think it can’t be necessary to abandon the law of excluded middle in order to justify the use of probabilities. In fact, even the proper scoring rule approach you outline doesn’t really throw out the law of excluded middle, because unless there’s some point at which the question will resolve as either True or False there’s no reason for you to report your “true credence” to maximize your expected score and so the whole edifice falls apart.
I think it is not circular, though I can imagine why it seems so. Let me try to elaborate the order of operations as I see it.
Syntax: Accept that a probability-sentence like “P(there will be a sea-battle tomorrow) ≥ 0.4” is at least syntactically parseable, i.e. not gibberish, even if it is semantically disqualified from being true (like “the present King of France is a human”).
This can be formalized as adding a new term-former P:ClassicalSentence→ProbabilityTerm, other term-formers such as +:ProbabilityTerm×ProbabilityTerm→ProbabilityTerm, constants C:Q→ProbabilityTerm, and finally a predicate ≥0:ProbabilityTerm→ProbabilitySentence.
Logic: Accept that probability-sentences can be the premises and/or conclusions of valid deductions, such as P(A)≥0.4,P(B∧A)≥0.5⋅P(A)⊢P(B)≥0.2.
Axiomatizing the valid deductions in a sound and complete way is not as easy as it may seem, because of the interaction with various expressive features one might want (native conditional probabilities, higher-order probabilities, polynomial inequalities) and model-theoretic and complexity-theoretic issues (pathological models, undecidable satisfiability). Some contenders:
LPWF, which has polynomial inequalities but not higher-order probabilities
LCP, which has higher-order conditional probabilities but not inequalities
LPP2, which has neither, but has decidable satisfiability.
Anyway, the basic axioms about probability that we need for such logics are:
P(α)≥0
P(⊤)=1
P(⊥)=0
P(α)+P(β)=P(α∨β)+P(α∧β)
P(α↔β)=1→P(α)=P(β)
α⊢P(α)=1
Those axioms can, if you wish, be derived from much weaker principles by Cox-style theorems. It’s important to admit that Cox’s proof of his original theorem (as cited by Jaynes) was mistaken, so there isn’t actually a “Cox’s theorem”, but rather a family of variants that actually work given different assumed principles. My favorite is Van Horn 2017, which uses only the following principles:
Equivalence-invariance: If X↔Y and X→(A↔B), then c(A|X)=c(B|Y).
Definition-invariance: If s is an atomic proposition not appearing in A, X, or E, then c(A|X)=c(A|X∧(s↔E)).
Irrelevance-invariance: If Y is a noncontradictory formula sharing no symbols with either X or A, then c(A|X)=c(A|X∧Y).
Implication-compatibility: If X→(A→B) but not X→(B→A), then c(A|X)<c(B|X).
Epistemics: Revise the Aristotelian norms, as follows:
Instead of demanding that a rational speaker either assert or deny any classical sentence about relevant propositions, instead demand that
a rational speaker assert or deny any probability-sentence about relevant propositions, and that
all their assertions be coherent, in the sense that probability-logic cannot deduce ⊥ from any subset of them.
Instead of classifying a speaker as either correct or incorrect (depending on whether they assert what is and deny what is not or deny what is and assert what is not), score them on the basis of the greatest rational q for which they asserted P(A)−q≥0 (where A is the conjunction of all of “what is”, or rather what is observed), and award them logq points.
The logq rule in particular can be justified and characterized at this stage just by the property of invariance under observation orderings, i.e. logP(A0)+logP(A1|A0)=logP(A1) (discussed more below)
Decision theory: Optionally, you can now assume the vNM axioms on top of the probabilistic logic, prove the vNM theorem, formalize a speech-act game internalizing the logP(A) rule, and then prove a revelation theorem that says that the optimal policy for obtaining epistemic points is to report one’s actual internal beliefs.
I think the key confusion here is that it may seem like one needs the decision theory set up already in order to justify the scoring rule (to establish that it incentivizes honest revelation), but the decision theory also depends on the scoring rule. I claim that the scoring rule can be justified on other grounds than honest revelation. If you don’t buy the argument of invariance under observation orderings, I can probably come up with other justifications, e.g. from coding theory. Closing the decision-theoretic loop also does provide some justificatory force, even if it is circular, since being able to set up a revelation theorem is certainly a nice feature of this logP(A) norm.
But fundamentally, whether in this system or Aristotle’s, one doesn’t identify the epistemic norms by trying to incentivize honest reporting of beliefs, but rather by trying to validate reports that align with reality. The logP(A) rule stands as a way of extending the desire for reports that align with reality to the non-Boolean logic of probability, so that we can talk rationally about sea-battles and other uncertain events, without having to think about in what order we find things out.
I haven’t studied this difference, but I want to register my initial intuition that to the extent other proper scoring rules give different value-of-information incentives than the log scoring rule, they are worse and the incentives from the log rule are better. In particular, I expect the incentives of the log rule to be more invariant to different ways of asking multiple questions that basically add up to one composite problem domain, and that being sensitive to that would be a misfeature.
Even if a question never resolves fully enough to make all observables either True or False (i.e., if the possibility space is Hausdorff, resolves to a Dirac delta), but just resolves incrementally to more and more precise observations A0⊃A1⊃⋯⊃Ak⊃⋯, the log scoring rule remains proper, since
logP(Ak)+logP(Ak+1|Ak)=logP(Ak)+logP(Ak+1∩Ak)P(Ak)=logP(Ak)+logP(Ak+1)P(Ak)=logP(Ak)+logP(Ak+1)−logP(Ak)=logP(Ak+1)I don’t think the same can be said for the Brier scoring rule; it doesn’t even seem to have a well-defined generalization to this case.
There are a couple fiddly assumptions here I should bring out explicitly:
when it comes to epistemic value, we should have a temporal discount factor of γ=1, very much unlike prudential or ethical values where I argue the discount factor must be γ<1.
If we don’t do this, then we get an incentive to smear out our forecasts to the extent we expect high precision to take a long time to obtain.
This is one reason to keep epistemic value as a separate normative domain from other kinds of value.
The point you mentioned parenthetically about contingencies correlating with marginal utility is another reason to keep utility separate from epistemic value.
When we decide what probabilistic statements to make, we should act as-if either the question will eventually resolve fully, or “there will always be more to discover” and that more is always discovered eventually.
Big tangent: There is a resonance here with CEV, where we try to imagine an infinite future limit of all ethical knowledge having been learned, and judge our current intentions by that standard, without discounting it for being far in the future, or discounting the whole scenario for being less-than-certain that ethical beings will survive and continue their ethical development indefinitely or until there is nothing more to learn.
Here we are sort-of in the domain of ethics, where I’d say temporal discounting is necessary, but methodologically the question of how to determine ethical value is an epistemic one. So we shouldn’t discount future ethical-knowledge Bayes-points, but we can still discount object-level ethical value.