A basis for pattern-matching in logical uncertainty
Previous logical uncertainty robot designs (e.g. here, here, here) have relied on proving theorems that relate to the statement (e.g. “the trillionth digit of pi is 5”) under inspection (where a useful theorem would be e.g. “there are 10 possible mutually exclusive digits [1,2,3,4,5,6,7,8,9,0] the trillionth digit of pi could be.”). This is nice. But it doesn’t do pattern-matching—if you tell a theorem-proving robot that a statement is true for the first thousand integers, it doesn’t even suspect that the statement might be true for the 1001st too. The statements are just independent black boxes.
In order to go further, our robot needs to peek inside the black box. But how, and why should it see patterns?
We tell our robot these facts: “3 is ‘odd’. 5 is ‘odd’. 7 is ‘odd’. 11 is ‘odd’.”
The robot asks “Could you just tell me what this concept ‘odd’ means explicitly? I don’t know English very well.”
“No,” we say. “You’ll have to guess.”
For robots that don’t make up information out of thin air, guesses about ‘odd’-ness will obey the maximum-entropy principle, which roughly means “give everything equal probability until you learn more.”
“Well, robot, what do you think is the probability that 9 is ‘odd’, given what we’ve told you?”
“50 percent. Beep boop.”
“But all those other things were odd, weren’t they?”
“Those other things were not the number 9. Imagine going number by number and labeling each number as ‘odd’ or ‘not odd’. There are equal numbers of possible labelings with 9 odd and 9 not odd, regardless of what those other numbers are. Since by the principle of maximum entropy I start out with all labelings equally likely, 9 has a 50% chance of being ‘odd’.”
“But isn’t it just sensible to think that the next number we give you will be ‘odd’ if all the others were? What of the appeal of simplicity?”
“What’s so great about simplicity? I’m trying not to make up information out of thin air here.”
What do we tell our robot to make it guess that 9 is probably odd? Well, we want it to think that patterns of odd-ness that are simpler are more likely. So how about we let the robot peek at our definition of ‘odd’, just long enough to see that for every integer we can test if it’s odd, and how complicated the test is.
Even if our robot has no clue what the constituent symbols were (or, more to the point, not enough processing power to fully explore the ramifications in a more difficult and realistic scenario), knowing the mere number of characters defining ‘odd’ puts an upper bound on the Kolmogorov complexity of how the numbers are labeled. Heck, even just learning that we can write down the oddness-test is huge, since there are an infinite number of infinite-complexity rules.
Once the robot knows that the complexity of ‘odd’ is below a certain smallish value, low-complexity hypotheses like “all numbers are ‘odd’” and “odd numbers are ‘odd’” start to outweigh1 bigger hypotheses like “all numbers except {long list including 9} are ‘odd’.” These simple hypotheses contain just the sorts of patterns we sometimes want the robot to see, like ‘odd’-ness.
We tell our robot these facts: “3 is odd. 5 is odd. 7 is odd. 11 is odd. A number is ‘odd’ if a 14-character predicate is true.”
The robot asks “Could you just tell me what this concept ‘odd’ means explicitly?”
“No,” we say. “You’ll have to guess. What do you think is the probability that 9 is ‘odd’, given what we’ve told you?”
“65.1 percent.”
“Hah, got you! When we said ‘odd’ we secretly meant prime!”
I suspect that this kind of peeking handles most of the cases where we want something like pattern-matching (specifically, minimum-message-length prediction of patterns) in logical uncertainty. The obvious un-handled case—the choice of axioms, or logical systems, or that sort of thing, seems more like model uncertainty than logical uncertainty—that is, the question is not really “is second-order logic true,” it’s “does second-order logic correspond to the world I want to model,” which is beyond the scope of this Discussion article.
1 How do you turn the number of symbols we wrote it down with into a distribution over possible rules? It’s easiest to just maximum-entropy all valid rules with the correct number of symbols. Since many rules can be compressed, the set of rules with some number of symbols is smeared over lower possible Kolmogorov complexities, I think with an exponential preference towards lower complexity. But on the other hand, humans don’t hand out maximum-entropy samples of definitions—we’ll need probabilistic logic to do probabilistic logic. (Yo dawg, I heard you liked recursion, so we put this joke in itself so you can [this joke] while you [this joke].)
- Solutions and Open Problems by 15 Mar 2014 6:53 UTC; 13 points) (
- 28 Dec 2013 19:03 UTC; 6 points) 's comment on Results from MIRI’s December workshop by (
The computer said 65%. That’s 35% to cover stuff like ‘prime’, hardly an embarrassing failure.
Yeah the robot’s pretty great.
Consider a parallel case:
We tell our robot these facts: “3 is ‘odd’. 5 is ‘odd’. 7 is ‘odd’. 11 is ‘odd’.” … “Well, robot, what do you think is the probability that 10 is ‘odd’, given what we’ve told you?”
The simplest hypothesis is that “odd” is just a synonym for “a natural number”.
Yup, that is the simplest hypothesis. Interestingly, the next simplest by many lights are actually the exclusion of a single simple number. Then we get to alternating numbers, and soon after that we get enough space to explode the number of possibilities beyond a modern computer’s ability to keep track of the implications, necessitating us to call logical uncertainty methods inside our logical uncertainty methods (yo dawg, I heard you like recursion...).
Note that there are currently systems which can construct conjectures. See the discussion here. Also, PAC learning can do things similar to what you are talking about here.
Recognizing some common characteristics of objects to be placed in the not ‘odd’ bin would also lower the upper bound on the complexity.
Could you explain?
like, ’Oh, and also 4 is not odd, nor is 8, nor 10.”
If you want to rule out ‘prime’, you can add ‘2’ to that list.
Sure, will you take some python code as an example? I had to replace spaces with periods, the verbatim formatting doesn’t seem to take into account python indented by 4 spaces.
Without taking into negative training data into account:
Taking negative training data into account, here we have a ‘positive set’, and a ‘negative set’:
See the difference? In the second case, ‘potential properties’ is smaller. Note that this is not an optimal solution, since it looks up all possible properties in order to find the common properties of a training set, I wrote it because it’s a little more succinct than intersections.
Abram Demski’s system does exactly this if you take his probability distribution and update on the statements “3 is odd”, “5 is odd”, etc. in a Bayesian manner. That’s because his distribution assigns a reasonable probability to statements like “odd numbers are odd”. Updating gives you reasonable updates on evidence.
His distribution also assigns a “reasonable probability” to statements like “the first 3^^^3 odd numbers are ‘odd’, then one isn’t, then they go back to being ‘odd’.” In the low computing power limit, these are assigned very similar probabilities. Thus, if the first 3^^^3 odd numbers are ‘odd’, it’s kind of a toss-up what the next one will be.
Do you disagree? If so, could you use math in explaining why?
What is “the low computing power limit”? If our theories behave badly when you don’t have computing power, that’s unsurprising. Do you mean “the large computing power limit”.
I think probability ( “the first 3^^^3 odd numbers are ‘odd’, then one isn’t, then they go back to being ‘odd’.” ) / probability (“all odd numbers are ‘odd’”) is approximately 2^(length of 3^^^3) in Abram’s system, because the probability of them appearing in the random process is supposed to be this ratio. I don’t see anything about the random process that would make the first one more likely to be contradicted before being stated than the second.
Nope. The key point is that as computing power becomes lower, Abram’s process allows more and more inconsistent models.
The probability of a statement appearing first in the model-generating process is not equal to the probability that it’s modeled by the end.
So does every process.
True. But for two very strong statements that contradict each other, there’s a close relationship.
Highly related: Bayesian Concept Learning (pdf).
Yeah, it’s also closely related to minimum-message-length prediction.
But this isn’t about learning from examples. Though man, the way I presented this article sure does seem like it’s about learning from examples. Whoops, bad pedagogy.
It’s about how, when we prove the number of smooth structures on R^2, we as humans automatically use that information to make predictions about the number of smooth structures on R^3. The question was—is there a basis for that that we can formalize?
This is just induction for sequence prediction. We have plenty of tools for this.
I’m kind of confused. Did we really mean odds or primes? If we told the robot that this statement was true for the N integers, shouldn’t we have said it correctly? If we did mean primes, then could at least have been honest, and said ‘2, 3, 5, 7’.
Oh, whoops. I’ll fix that to make it more ambiguous. :)
Anyhow, it doesn’t matter what we meant—what was communicated to the robot was “a property shared by 3, 5, 7 that can be tested for in 14 characters.” The robot doesn’t really understand English labels.