To choose which bits to update on, we can mechanically search through all possible proofs in Peano arithmetic (PA) by increasing length. Whenever we prove a sentence with Gödel number N, we tell the prior that the Nth bit is 1; whenever we disprove one, we tell the prior that the bit is 0. After each update, the prior’s opinions about all other bits (and thus all other sentences in PA) will shift, hopefully in a meaningful way.
I don’t think this will give us what we want. There’s a short program which searches through all possible proofs in PA up to length say 10^100, then starts printing out bits, 1 for the Nth bit if it found a proof for sentence with Gödel number N, 0 if it found a disproof for that sentence, and a coin toss if it didn’t find either a proof or disproof. This program would do perfectly on the bits that you propose, and it’s hard to think of comparably short programs that would do just as well on those bits (and aren’t just variations of this with different proof length limits), so if you update the universal prior on this set of bits you’ll end up with a posterior that consists almost entirely of this type of program/hypothesis, which seems useless both conceptually and for practical purposes.
You could instead try updating the universal prior with the probability estimates of a human mathematician on various mathematical statements, but then after a while you might just be picking out that particular human from the space of all programs. I don’t know what else that can be done without run into problems like these.
Sure, we probably end up with mostly slow programs. The interesting question to me is whether the approach works at all. If it does—if using sequence prediction for logical uncertainty is okay—then the next step is to try some other method of sequence prediction that rewards fast programs, like the speed prior. (Or it might happen anyway as we approximate the universal prior, though I’m not sure of that.) I wonder how well that would work, compared to the logical induction paper.
Using only speed to evaluate models lands you with a lookup table that stores the results you want. So you have to trade off speed and length: The speediest table of math results has length O(N) and speed O(N) (maybe? Not at all sure). The shortest general program has length O(log(N)) and uncomputably fast-growing time. So if we think of a separable cost function F(length)+G(time), as long as F doesn’t grow super-fast nor G super-slow, eventually the lookup table will have better score than brute-force search.
You’re Manfred! Huh. I noticed you making an unusual amount of sense in some comment threads, but didn’t make the connection. Great to meet you again!
Using a combination of length and time sounds right. The speed prior uses length+log(time), I don’t know if it’s the best choice but it seems okay?
That said, I don’t quite understand your post. The universal prior is just a prior, not any particular approximation scheme. Sure, it’s equivalent to a mixture of all programs that predict bits exactly, but it’s also equivalent to a mixture of all programs that compute probability distributions over bits. (That fact is surprising and nontrivial, but true. It’s in the Li and Vitanyi book.) So you can approximate it on noisy data just fine.
Yeah, I know that I’m somewhat sloppily identifying the priors with the programs that make up most of their weight. It makes it more convenient for me to think about what agents using those priors would do—though I am probably missing some details that would stem from using a computable approximation rather than the possibly uncomputable one of looking at the most successful few programs.
And it serves me right for not googling “speed prior.” I forgot it was length+log(time). I’m sure computational complexity people would have much more interesting things to say than me about why that might get you plenty of mathematics programs that are neither brute force nor lookup tables. Or maybe it’s just that taking the log of time to turn it into “bits” is the number one obvious thing to do.
If we think of the Solomonoff prior as the “real answer” and a speed-like prior as a convenience used for making computable predictions when computation has a cost, we could get some kind of principled answer from estimates of the benefit of predictive accuracy and the cost of time. I spent a few minutes and couldn’t figure it out—worst case reasoning about predictive accuracy gets stuck on some very rare worst cases, and I’m not sure how to do the average case reasoning right.
I don’t think this will give us what we want. There’s a short program which searches through all possible proofs in PA up to length say 10^100, then starts printing out bits, 1 for the Nth bit if it found a proof for sentence with Gödel number N, 0 if it found a disproof for that sentence, and a coin toss if it didn’t find either a proof or disproof. This program would do perfectly on the bits that you propose, and it’s hard to think of comparably short programs that would do just as well on those bits (and aren’t just variations of this with different proof length limits), so if you update the universal prior on this set of bits you’ll end up with a posterior that consists almost entirely of this type of program/hypothesis, which seems useless both conceptually and for practical purposes.
You could instead try updating the universal prior with the probability estimates of a human mathematician on various mathematical statements, but then after a while you might just be picking out that particular human from the space of all programs. I don’t know what else that can be done without run into problems like these.
Sure, we probably end up with mostly slow programs. The interesting question to me is whether the approach works at all. If it does—if using sequence prediction for logical uncertainty is okay—then the next step is to try some other method of sequence prediction that rewards fast programs, like the speed prior. (Or it might happen anyway as we approximate the universal prior, though I’m not sure of that.) I wonder how well that would work, compared to the logical induction paper.
Using only speed to evaluate models lands you with a lookup table that stores the results you want. So you have to trade off speed and length: The speediest table of math results has length O(N) and speed O(N) (maybe? Not at all sure). The shortest general program has length O(log(N)) and uncomputably fast-growing time. So if we think of a separable cost function F(length)+G(time), as long as F doesn’t grow super-fast nor G super-slow, eventually the lookup table will have better score than brute-force search.
Ideally you want to find some happy medium—this is reminding my of my old post on approximate induction.
You’re Manfred! Huh. I noticed you making an unusual amount of sense in some comment threads, but didn’t make the connection. Great to meet you again!
Using a combination of length and time sounds right. The speed prior uses length+log(time), I don’t know if it’s the best choice but it seems okay?
That said, I don’t quite understand your post. The universal prior is just a prior, not any particular approximation scheme. Sure, it’s equivalent to a mixture of all programs that predict bits exactly, but it’s also equivalent to a mixture of all programs that compute probability distributions over bits. (That fact is surprising and nontrivial, but true. It’s in the Li and Vitanyi book.) So you can approximate it on noisy data just fine.
Nice to meet you again too :)
Yeah, I know that I’m somewhat sloppily identifying the priors with the programs that make up most of their weight. It makes it more convenient for me to think about what agents using those priors would do—though I am probably missing some details that would stem from using a computable approximation rather than the possibly uncomputable one of looking at the most successful few programs.
And it serves me right for not googling “speed prior.” I forgot it was length+log(time). I’m sure computational complexity people would have much more interesting things to say than me about why that might get you plenty of mathematics programs that are neither brute force nor lookup tables. Or maybe it’s just that taking the log of time to turn it into “bits” is the number one obvious thing to do.
If we think of the Solomonoff prior as the “real answer” and a speed-like prior as a convenience used for making computable predictions when computation has a cost, we could get some kind of principled answer from estimates of the benefit of predictive accuracy and the cost of time. I spent a few minutes and couldn’t figure it out—worst case reasoning about predictive accuracy gets stuck on some very rare worst cases, and I’m not sure how to do the average case reasoning right.