Intuitively, it feels that there is something special about mathematical knowledge from a learning-theoretic perspective. Mathematics seems infinitely rich: no matter how much we learn, there is always more interesting structure to be discovered. Impossibility results like the halting problem and Godel incompleteness lend some credence to this intuition, but are insufficient to fully formalize it.
Here is my proposal for how to formulate a theorem that would make this idea rigorous.
(Wrong) First Attempt
Fix some natural hypothesis class for mathematical knowledge, such as some variety of tree automata. Each such hypothesis Θ represents an infradistribution over Γ: the “space of counterpossible computational universes”. We can say that Θ is a “true hypothesis” when there is some θ in the credal set Θ (a distribution over Γ) s.t. the ground truth Υ∗∈Γ “looks” as if it’s sampled from θ. The latter should be formalizable via something like a computationally bounded version of Marin-Lof randomness.
We can now try to say that Υ∗ is “rich” if for any true hypothesis Θ, there is a refinement Ξ⊆Θ which is also a true hypothesis and “knows” at least one bit of information that Θ doesn’t, in some sense. This is clearly true, since there can be no automaton or even any computable hypothesis which fully describes Υ∗. But, it’s also completely boring: the required Ξ can be constructed by “hardcoding” an additional fact into Θ. This doesn’t look like “discovering interesting structure”, but rather just like brute-force memorization.
(Wrong) Second Attempt
What if instead we require that Ξ knows infinitely many bits of information that Θ doesn’t? This is already more interesting. Imagine that instead of metacognition / mathematics, we would be talking about ordinary sequence prediction. In this case it is indeed an interesting non-trivial condition that the sequence contains infinitely many regularities, s.t. each of them can be expressed by a finite automaton but their conjunction cannot. For example, maybe the n-th bit in the sequence depends only the largest k s.t.2k divides n, but the dependence on k is already uncomputable (or at least inexpressible by a finite automaton).
However, for our original application, this is entirely insufficient. This is because in the formal language we use to define Γ (e.g. combinator calculus) has some “easy” equivalence relations. For example, consider the family of programs of the form “if 2+2=4 then output 0, otherwise...”. All of those programs would output 0, which is obvious once you know that 2+2=4. Therefore, once your automaton is able to check some such easy equivalence relations, hardcoding a single new fact (in the example, 2+2=4) generates infinitely many “new” bits of information. Once again, we are left with brute-force memorization.
(Less Wrong) Third Attempt
Here’s the improved condition: For any true hypothesis Θ, there is a true refinement Ξ⊆Θ s.t. conditioning Θon any finite set of observations cannot produce a refinement ofΞ.
There is a technicality here, because we’re talking about infradistributions, so what is “conditioning” exactly? For credal sets, I think it is sufficient to allow two types of “conditioning”:
For any given observation A and p∈(0,1], we can form {θ∈Θ∣θ(A)≥p}.
For any given observation A s.t. minθ∈Θθ(A)>0, we can form {(θ∣A)∣θ∈Θ}.
This rules-out the counterexample from before: the easy equivalence relation can be represented inside Θ, and then the entire sequence of “novel” bits can be generated by a conditioning.
Alright, so does Υ∗ actually satisfy this condition? I think it’s very probable, but I haven’t proved it yet.
Formalizing the richness of mathematics
Intuitively, it feels that there is something special about mathematical knowledge from a learning-theoretic perspective. Mathematics seems infinitely rich: no matter how much we learn, there is always more interesting structure to be discovered. Impossibility results like the halting problem and Godel incompleteness lend some credence to this intuition, but are insufficient to fully formalize it.
Here is my proposal for how to formulate a theorem that would make this idea rigorous.
(Wrong) First Attempt
Fix some natural hypothesis class for mathematical knowledge, such as some variety of tree automata. Each such hypothesis Θ represents an infradistribution over Γ: the “space of counterpossible computational universes”. We can say that Θ is a “true hypothesis” when there is some θ in the credal set Θ (a distribution over Γ) s.t. the ground truth Υ∗∈Γ “looks” as if it’s sampled from θ. The latter should be formalizable via something like a computationally bounded version of Marin-Lof randomness.
We can now try to say that Υ∗ is “rich” if for any true hypothesis Θ, there is a refinement Ξ⊆Θ which is also a true hypothesis and “knows” at least one bit of information that Θ doesn’t, in some sense. This is clearly true, since there can be no automaton or even any computable hypothesis which fully describes Υ∗. But, it’s also completely boring: the required Ξ can be constructed by “hardcoding” an additional fact into Θ. This doesn’t look like “discovering interesting structure”, but rather just like brute-force memorization.
(Wrong) Second Attempt
What if instead we require that Ξ knows infinitely many bits of information that Θ doesn’t? This is already more interesting. Imagine that instead of metacognition / mathematics, we would be talking about ordinary sequence prediction. In this case it is indeed an interesting non-trivial condition that the sequence contains infinitely many regularities, s.t. each of them can be expressed by a finite automaton but their conjunction cannot. For example, maybe the n-th bit in the sequence depends only the largest k s.t.2k divides n, but the dependence on k is already uncomputable (or at least inexpressible by a finite automaton).
However, for our original application, this is entirely insufficient. This is because in the formal language we use to define Γ (e.g. combinator calculus) has some “easy” equivalence relations. For example, consider the family of programs of the form “if 2+2=4 then output 0, otherwise...”. All of those programs would output 0, which is obvious once you know that 2+2=4. Therefore, once your automaton is able to check some such easy equivalence relations, hardcoding a single new fact (in the example, 2+2=4) generates infinitely many “new” bits of information. Once again, we are left with brute-force memorization.
(Less Wrong) Third Attempt
Here’s the improved condition: For any true hypothesis Θ, there is a true refinement Ξ⊆Θ s.t. conditioning Θ on any finite set of observations cannot produce a refinement of Ξ.
There is a technicality here, because we’re talking about infradistributions, so what is “conditioning” exactly? For credal sets, I think it is sufficient to allow two types of “conditioning”:
For any given observation A and p∈(0,1], we can form {θ∈Θ∣θ(A)≥p}.
For any given observation A s.t. minθ∈Θθ(A)>0, we can form {(θ∣A)∣θ∈Θ}.
This rules-out the counterexample from before: the easy equivalence relation can be represented inside Θ, and then the entire sequence of “novel” bits can be generated by a conditioning.
Alright, so does Υ∗ actually satisfy this condition? I think it’s very probable, but I haven’t proved it yet.