So what does this say about cases where the correct answer is implied by the axioms but you don’t have enough computing power to find it? From what I can tell, this picks out the class of successful proofs a:A and gives them probability 1. But I’m not really sure.
Well, the actual idea is that your axioms might not have probability 1.0 (that is, you might have a distribution over axiom sets, since the actual definition of your logic is given as natural-deduction rules instead of in Hilbert style), and also that you could do partial proofs. A partial proof would contain an unbound proof variable (an unproven hypothesis), whose probability would go according to its Solomonoff measure.
And as it turns out, Harper’s Practical Foundations for Programming Languages gives a much better coverage of the syntax and semantics of a natural-deduction system for classical logic, so now I’ve got a good deal more to work with for probabilizing things.
If I’m understanding this right, then although this kind of approach is useful in many ways (for a local example, it’s basically used in MIRI’s interesting paper on self-trust), it doesn’t quite capture what I want. For example, I am extremely uncertain what the 31415926th digit of pi is, even if we take arithmetic and the definition of pi as given.
This is basically the difference between assigning a sensible probability measure to logical sentences, and defining an efficiently-computable yet still useful “logical probability” function.
For example, I am extremely uncertain what the 31415926th digit of pi is, even if we take arithmetic and the definition of pi as given.
This sentence is ambiguous, but let me see if I can interpret what you’re talking about.
The Solomonoff Measure goes on the judgement, a : T, where a is a proof-object/construction and T is a proposition/type (it’s actually a bit more complicated than this for classical logic, but bear with me); it measures “how much space do constructions of type T occupy in the space of all possible constructions our deduction rules can generate?” (this may not be precisely the Solomonoff distribution, and constructing it is probably a piece of research in and of itself, but it’s still based on Kolmogorov’s core idea for measures over countably infinite sets). However, if we condition on the typing judgement, we can then actually put a completely typical maximum-entropy (in this case: uniform) prior over the different introduction rules (usually labeled “data constructors” in ordinary programming) which could have yielded our unknown proof-object of the appropriate type.
This would give you a uniform prior over digits when asking about the nth digit of \pi. The “unknown” proof object, the unbound variable or free-standing assumption, can then be used as a prior to perform Bayesian updating based on “empirical” evidence, or the prior can be used to make decisions. Or, when we have the computing power available, we can actually use equational reasoning to replace the unknown with a more refined (ie: computed-forward) proof object.
Except that actually means we’re talking about something like a relationship between head-normal-form proof objects and non-normalized proof objects, with free-standing assumptions (unbound variables) being used only for things like empirical hypotheses, as opposed to computations we just haven’t managed to do.
Unfortunately, all of that stuff comes after you’ve already got the fundamental logic nailed down and can start adding things like inductive types capable of describing such arbitrary data as the natural numbers.
EDIT: In light of “that stuff” (probabilities over digits of \pi) being largely separate from a full way of establishing probability values for arbitrary logical formulas, I will probably go look up Kolmogorov’s method for constructing measures over countably infinite sets and write an entire article on using Kolmogorov measures over arbitrary inductive types and equational reasoning to build distributions over non-normalized computational terms.
This would give you a uniform prior over digits when asking about the nth digit of \pi.
This is very likely a non sequitur. But yes, I agree that in this formalism it’s tough to express the probability of a sentence.
free-standing assumptions (unbound variables) being used only for things like empirical hypotheses, as opposed to computations we just haven’t managed to do.
If this mean what I think it means, then yes, this was my interpretation too—plus assuming the truth or falsity of propositions that are undecided by the axioms.
In light of “that stuff” (probabilities over digits of \pi) being largely separate from a full way of establishing probability values for arbitrary logical formulas
I would like to reiterate that there is a difference between assigning probabilities and the project of logical probability. Probabilities have to follow the product rule, which is contains modus ponens in the case of certainty. If you assign probability 1 to the axioms, there is only one correct distribution for anything that follows from the axioms, like a digit of pi (or as the kids are calling it these days, $\pi$): probability 1 for the right answer, 0 for the wrong answers. If you want logical probabilities to deviate from this (like by being ignorant about some digit of pi), then logical probabilities cannot follow the same rules as probabilities.
I will probably go look up Kolmogorov’s method for constructing measures over countably infinite sets and write an entire article on using Kolmogorov measures over arbitrary inductive types and equational reasoning to build distributions over non-normalized computational terms.
Cool. If you use small words, then I will be happy :)
If this mean what I think it means, then yes, this was my interpretation too—plus assuming the truth or falsity of propositions that are undecided by the axioms.
Or assigning degrees of belief to the axioms themselves! Any mutually-consistent set of statements will allow some probability assignment in which each statement has a nonzero degree of belief.
If you assign probability 1 to the axioms, there is only one correct distribution for anything that follows from the axioms, like a digit of pi (or as the kids are calling it these days, $\pi$): probability 1 for the right answer, 0 for the wrong answers. If you want logical probabilities to deviate from this (like by being ignorant about some digit of pi), then logical probabilities cannot follow the same rules as probabilities.
Which is actually an issue I did bring up, in proof-theoretic language. Let’s start by setting the term “certain distribution” to mean “1.0 to the right answer, 0.0 to everything else”, and then set “uncertain distribution” to be “all distributions wider than this”.
Except that actually means we’re talking about something like a relationship between head-normal-form proof objects and non-normalized proof objects,
A non-normalized proof object (read: the result of a computation we haven’t done yet, a lazily evaluated piece of data) has an uncertain distribution over values. To head-normalize a proof object means to evaluate enough of it to decide (compute the identity of, with certainty) the outermost introduction rule (outermost data constructor), but (by default) this still leaves us uncertain of the premises which were given to that introduction rule (the parameters passed to the data constructor). As we proceed down the levels of the tree of deduction rules, head-normalizing as we go, we eventually arrive to a completely normalized proof-term/piece of data, for which no computation remains undone.
Only a completely normalized proof term has a certain distribution: 1.0 for the right answer, 0.0 for everything else. In all other cases, we have an uncertain distribution, albeit one in which all the probability mass may be allocated in the particular region associated with only one outermost introduction rule (the special case of head-normalized terms).
Remember, these distributions are being assigned over proof terms (ie: constructions in some kind of lambda calculus), not over syntactic values (well-typed and fully normalized terms). The distributions express our state of knowledge given the limited certainty granted by however many levels of the tree (all terms of inductive types can be written as trees) are already normalized to data constructors—they express uncertain states of knowledge in situations where the uncertainty derives from having a fraction of the computational power necessary to decide a decidable question.
So when we’re talking about “digits of $\pi$” sorts of problems, we should actually speak of distributions over computational terms, not “logical probabilities”. This problem has nothing to do with assigning probabilities to sentences in first-order classical logic: by the time we construct and reason with our distributions over computational terms, we have already established the existence of a proof object inhabiting a particular type. If that type happened to be a mere proposition, then we’ve already established, by the time we construct this distribution, that we believe in the truth of the proposition, and are merely reasoning under discrete uncertainty over how its truth was proof-theoretically asserted.
Whereas I think that establishing probabilities for sentences which are not decided by our given axioms may be more difficult, particularly when those sentences may be decidable by other means.
So what does this say about cases where the correct answer is implied by the axioms but you don’t have enough computing power to find it? From what I can tell, this picks out the class of successful proofs a:A and gives them probability 1. But I’m not really sure.
Well, the actual idea is that your axioms might not have probability 1.0 (that is, you might have a distribution over axiom sets, since the actual definition of your logic is given as natural-deduction rules instead of in Hilbert style), and also that you could do partial proofs. A partial proof would contain an unbound proof variable (an unproven hypothesis), whose probability would go according to its Solomonoff measure.
And as it turns out, Harper’s Practical Foundations for Programming Languages gives a much better coverage of the syntax and semantics of a natural-deduction system for classical logic, so now I’ve got a good deal more to work with for probabilizing things.
If I’m understanding this right, then although this kind of approach is useful in many ways (for a local example, it’s basically used in MIRI’s interesting paper on self-trust), it doesn’t quite capture what I want. For example, I am extremely uncertain what the 31415926th digit of pi is, even if we take arithmetic and the definition of pi as given.
This is basically the difference between assigning a sensible probability measure to logical sentences, and defining an efficiently-computable yet still useful “logical probability” function.
This sentence is ambiguous, but let me see if I can interpret what you’re talking about.
The Solomonoff Measure goes on the judgement,
a : T
, wherea
is a proof-object/construction andT
is a proposition/type (it’s actually a bit more complicated than this for classical logic, but bear with me); it measures “how much space do constructions of type T occupy in the space of all possible constructions our deduction rules can generate?” (this may not be precisely the Solomonoff distribution, and constructing it is probably a piece of research in and of itself, but it’s still based on Kolmogorov’s core idea for measures over countably infinite sets). However, if we condition on the typing judgement, we can then actually put a completely typical maximum-entropy (in this case: uniform) prior over the different introduction rules (usually labeled “data constructors” in ordinary programming) which could have yielded our unknown proof-object of the appropriate type.This would give you a uniform prior over digits when asking about the nth digit of \pi. The “unknown” proof object, the unbound variable or free-standing assumption, can then be used as a prior to perform Bayesian updating based on “empirical” evidence, or the prior can be used to make decisions. Or, when we have the computing power available, we can actually use equational reasoning to replace the unknown with a more refined (ie: computed-forward) proof object.
Except that actually means we’re talking about something like a relationship between head-normal-form proof objects and non-normalized proof objects, with free-standing assumptions (unbound variables) being used only for things like empirical hypotheses, as opposed to computations we just haven’t managed to do.
Unfortunately, all of that stuff comes after you’ve already got the fundamental logic nailed down and can start adding things like inductive types capable of describing such arbitrary data as the natural numbers.
EDIT: In light of “that stuff” (probabilities over digits of \pi) being largely separate from a full way of establishing probability values for arbitrary logical formulas, I will probably go look up Kolmogorov’s method for constructing measures over countably infinite sets and write an entire article on using Kolmogorov measures over arbitrary inductive types and equational reasoning to build distributions over non-normalized computational terms.
This is very likely a non sequitur. But yes, I agree that in this formalism it’s tough to express the probability of a sentence.
If this mean what I think it means, then yes, this was my interpretation too—plus assuming the truth or falsity of propositions that are undecided by the axioms.
I would like to reiterate that there is a difference between assigning probabilities and the project of logical probability. Probabilities have to follow the product rule, which is contains modus ponens in the case of certainty. If you assign probability 1 to the axioms, there is only one correct distribution for anything that follows from the axioms, like a digit of pi (or as the kids are calling it these days, $\pi$): probability 1 for the right answer, 0 for the wrong answers. If you want logical probabilities to deviate from this (like by being ignorant about some digit of pi), then logical probabilities cannot follow the same rules as probabilities.
Cool. If you use small words, then I will be happy :)
Or assigning degrees of belief to the axioms themselves! Any mutually-consistent set of statements will allow some probability assignment in which each statement has a nonzero degree of belief.
Which is actually an issue I did bring up, in proof-theoretic language. Let’s start by setting the term “certain distribution” to mean “1.0 to the right answer, 0.0 to everything else”, and then set “uncertain distribution” to be “all distributions wider than this”.
A non-normalized proof object (read: the result of a computation we haven’t done yet, a lazily evaluated piece of data) has an uncertain distribution over values. To head-normalize a proof object means to evaluate enough of it to decide (compute the identity of, with certainty) the outermost introduction rule (outermost data constructor), but (by default) this still leaves us uncertain of the premises which were given to that introduction rule (the parameters passed to the data constructor). As we proceed down the levels of the tree of deduction rules, head-normalizing as we go, we eventually arrive to a completely normalized proof-term/piece of data, for which no computation remains undone.
Only a completely normalized proof term has a certain distribution: 1.0 for the right answer, 0.0 for everything else. In all other cases, we have an uncertain distribution, albeit one in which all the probability mass may be allocated in the particular region associated with only one outermost introduction rule (the special case of head-normalized terms).
Remember, these distributions are being assigned over proof terms (ie: constructions in some kind of lambda calculus), not over syntactic values (well-typed and fully normalized terms). The distributions express our state of knowledge given the limited certainty granted by however many levels of the tree (all terms of inductive types can be written as trees) are already normalized to data constructors—they express uncertain states of knowledge in situations where the uncertainty derives from having a fraction of the computational power necessary to decide a decidable question.
So when we’re talking about “digits of $\pi$” sorts of problems, we should actually speak of distributions over computational terms, not “logical probabilities”. This problem has nothing to do with assigning probabilities to sentences in first-order classical logic: by the time we construct and reason with our distributions over computational terms, we have already established the existence of a proof object inhabiting a particular type. If that type happened to be a mere proposition, then we’ve already established, by the time we construct this distribution, that we believe in the truth of the proposition, and are merely reasoning under discrete uncertainty over how its truth was proof-theoretically asserted.
Whereas I think that establishing probabilities for sentences which are not decided by our given axioms may be more difficult, particularly when those sentences may be decidable by other means.
Thanks for explaining more!