Using the universal prior for logical uncertainty (retracted)
(I’m not sure about any of this)
(Yup! Vanessa Kowalski pointed out a crucial error, see her comment below. The idea about the universal prior picking up all computable regularities is unproved. I’m leaving the post as it was, but please read it as a historical document only.)
Abram Demski had an old paper describing a naive way to use the universal prior for logical uncertainty. He pointed out some problems with it, but I’m not sure the problems are real. I think the approach will work surprisingly well.
The idea is that, since the universal prior is just a probability distribution over bit sequences, we don’t have to use it to predict bits one by one. We can also update it on facts like “the seventh bit is 1” without mentioning the previous six bits. To choose which bits to update on, we can mechanically search through all possible proofs in Peano arithmetic (PA) by increasing length. Whenever we prove a sentence with Gödel number N, we tell the prior that the Nth bit is 1; whenever we disprove one, we tell the prior that the bit is 0. After each update, the prior’s opinions about all other bits (and thus all other sentences in PA) will shift, hopefully in a meaningful way.
It’s easy to see that the probability of any provable sentence will converge to 1, because at some point it becomes 1 by fiat, and disprovable sentences will likewise go to 0. How fast is the convergence? I think it will be extremely fast, far outpacing our slow proof search and subsuming any possible human or computer reasoning about PA—even though the prior knows nothing about PA and only receives opaque bits.
To see why, recall that the universal prior can pick up any computable regularities. For example, if you feed it a sequence where all even-numbered bits are 1 and odd-numbered bits are hard to predict, it will quickly become very certain that all future even-numbered bits will be 1. Similarly, if there’s some decidable set of Gödel numbers that for some reason correspond to provable sentences (no matter how long their proofs are), the universal prior will become very certain of that after receiving enough data. The same goes for regularities in relationships between bits, like “for all A and B, if A is provable and A→B is provable, then B is provable as well”. All such shortcuts will be learned in full generality as quickly as possible.
This approach to logical uncertainty can work well for questions that can be settled by PA, like digits of π. What about sentences that aren’t provable or disprovable in PA, like Con(PA)? It’s tempting to try and make the process converge for all of those as well, but I don’t want to do that, because many such sentences are basically hallucinations. Who cares what their probabilities are.
Why does this approach matter? Isn’t the universal prior uncomputable? But it’s approximable, so any approximation scheme could be used in the same way as logical induction (another computable process with uncomputably slow convergence). Also the universal prior is well-studied and has many nice properties.
I don’t want to diminish the amazing work that the logical induction folks have done. In some ways (e.g. reflection) it works better than the universal prior. But it’s interesting to me that the idea of probabilistic reasoning about future theorems based on a small amount of known theorems can be captured so well by a standard approach, similar to how UDT can be almost completely covered by a one-liner combination of standard game theory tools.
Can you cite the results you are relying on? There is this paper from by 2011 by Lattimore, Hutter and Gavane which shows that (a particular version of) Solomonoff induction can predict deterministic regular subsequences. However, it doesn’t cover fully general relationships between bits. For example, if future bit i is predictably (according to some computable regularity) equal to future bit j, afaik it is not proved that Solomonoff induction will assign both bits the same probability (it is only proved that it will correctly predict the later of the two bits after seeing all bits that precede it). See also the “Open Questions” section in the paper (there is no known stochastic generalization or known convergence rate). In fact, the ability to deal with fully general patterns is, I think, the main advantage of logical induction (that, and the fact it combines this ability with applying computational resource bounds).
Well, you’re right. I’ll put a retraction in the post. Thank you!
By asking for citations you’re giving me too much credit—I’m playing the game a few levels below you. I was going off the “obvious” idea that if our prior gives nonzero probability to a hypothesis, and we feed the prior a sequence of bits that matches the hypothesis, then the posterior probability of the hypothesis should rise to 1. Took me some time to realize that’s not true at all!
Here’s a simple counterexample. Say our hypothesis is “all even bits are 0”, and our prior is an equal mixture of “all even bits are 0 and all odd bits are random” and “all odd bits are 0 and all even bits are random”. Note that our hypothesis starts out with probability 1⁄2 according to the prior. But if we feed the prior a sequence of all 0′s, which matches the hypothesis, the posterior probability of the hypothesis won’t go to 1. It will keep oscillating forever. The prior just happens to be too good at guessing this particular sequence without help from our hypothesis.
I’d be surprised if something like that happened with computable hypotheses and the universal prior, but I don’t have a proof and couldn’t find one in a few hours. So thanks again.
Further, If Solomonoff Induction does get these problems right, it does so because closure properties on the class of hypotheses, not because of properties of the way in which the hypotheses are combined.
In the Logical Induction framework, if you add a bunch of other uncomputable hypotheses, you will still get the good properties on the predictable sub-patterns of the environment.
If you start with the Solomonoff Induction framework, this is demonstrably not true: If I have an environment which is 1 on even bits and uncomputable on odd bits, I can add an uncomputable hypothesis that knows all the odd bits. It can gain trust over time, then spend that trust to say that the next even bit has a 90% chance to be a 0. It will take a hit from this prediction, but can earn back the trust from the odd bits and repeat.
Isn’t that a complaint against Bayes, not just Solomonoff? Take any prior P. Take a sequence S whose even bits are all 1 and whose odd bits maximally disagree with P. Take another prior H that exactly knows the odd bits of S but thinks the even bits are random. Now the prior (P+H)/2 will never learn that all even bits of S are 1.
So I’m not quite ready to buy that complaint, because it seems to me that any good method should be at least approximately Bayesian. But maybe I don’t understand enough about what you’re doing...
It is a complaint against Bayes, but it is only a complaint against using Bayes in cases where the real world is probability 0 in your prior.
Part of the point of logical induction is that logic is complicated and no hypothesis in the logical induction algorithm can actually predict it correctly in full, but the algorithm allows for the hypotheses to prove themselves on a sub-pattern, and have the ensemble converge to the correct behavior on that sub-pattern.
Is there a simple “continuous” description of the class of objects that LI belongs to, which shows the point of departure from Bayes without relying on all details of LI? (For example, “it’s like a prior but the result also depends on ordering of input facts”.)
Not really.
You can generalize LI to arbitrary collections of hypotheses, and interpret it as being about bit sequences rather logic, but not much more than that.
The reason the LI paper talks about the LI criterion rather than a specific algorithm is to push in that direction, but it is not as clean as your example.
I’m not sure I understand the question correctly, but what “LI” actually depends on is, more or less, a collection of traders plus a “prior” over them (although you can’t interpret it as an actual prior since more than one trader can be important in understanding a given environment). Plus there is some ambiguity in the process of choosing fixed points (because there might be multiple fixed points).
I don’t think this will give us what we want. There’s a short program which searches through all possible proofs in PA up to length say 10^100, then starts printing out bits, 1 for the Nth bit if it found a proof for sentence with Gödel number N, 0 if it found a disproof for that sentence, and a coin toss if it didn’t find either a proof or disproof. This program would do perfectly on the bits that you propose, and it’s hard to think of comparably short programs that would do just as well on those bits (and aren’t just variations of this with different proof length limits), so if you update the universal prior on this set of bits you’ll end up with a posterior that consists almost entirely of this type of program/hypothesis, which seems useless both conceptually and for practical purposes.
You could instead try updating the universal prior with the probability estimates of a human mathematician on various mathematical statements, but then after a while you might just be picking out that particular human from the space of all programs. I don’t know what else that can be done without run into problems like these.
Sure, we probably end up with mostly slow programs. The interesting question to me is whether the approach works at all. If it does—if using sequence prediction for logical uncertainty is okay—then the next step is to try some other method of sequence prediction that rewards fast programs, like the speed prior. (Or it might happen anyway as we approximate the universal prior, though I’m not sure of that.) I wonder how well that would work, compared to the logical induction paper.
Using only speed to evaluate models lands you with a lookup table that stores the results you want. So you have to trade off speed and length: The speediest table of math results has length O(N) and speed O(N) (maybe? Not at all sure). The shortest general program has length O(log(N)) and uncomputably fast-growing time. So if we think of a separable cost function F(length)+G(time), as long as F doesn’t grow super-fast nor G super-slow, eventually the lookup table will have better score than brute-force search.
Ideally you want to find some happy medium—this is reminding my of my old post on approximate induction.
You’re Manfred! Huh. I noticed you making an unusual amount of sense in some comment threads, but didn’t make the connection. Great to meet you again!
Using a combination of length and time sounds right. The speed prior uses length+log(time), I don’t know if it’s the best choice but it seems okay?
That said, I don’t quite understand your post. The universal prior is just a prior, not any particular approximation scheme. Sure, it’s equivalent to a mixture of all programs that predict bits exactly, but it’s also equivalent to a mixture of all programs that compute probability distributions over bits. (That fact is surprising and nontrivial, but true. It’s in the Li and Vitanyi book.) So you can approximate it on noisy data just fine.
Nice to meet you again too :)
Yeah, I know that I’m somewhat sloppily identifying the priors with the programs that make up most of their weight. It makes it more convenient for me to think about what agents using those priors would do—though I am probably missing some details that would stem from using a computable approximation rather than the possibly uncomputable one of looking at the most successful few programs.
And it serves me right for not googling “speed prior.” I forgot it was length+log(time). I’m sure computational complexity people would have much more interesting things to say than me about why that might get you plenty of mathematics programs that are neither brute force nor lookup tables. Or maybe it’s just that taking the log of time to turn it into “bits” is the number one obvious thing to do.
If we think of the Solomonoff prior as the “real answer” and a speed-like prior as a convenience used for making computable predictions when computation has a cost, we could get some kind of principled answer from estimates of the benefit of predictive accuracy and the cost of time. I spent a few minutes and couldn’t figure it out—worst case reasoning about predictive accuracy gets stuck on some very rare worst cases, and I’m not sure how to do the average case reasoning right.