To see why, recall that the universal prior can pick up any computable regularities. For example, if you feed it a sequence where all even-numbered bits are 1 and odd-numbered bits are hard to predict, it will quickly become very certain that all future even-numbered bits will be 1. Similarly, if there’s some decidable set of Gödel numbers that for some reason correspond to provable sentences (no matter how long their proofs are), the universal prior will become very certain of that after receiving enough data. The same goes for regularities in relationships between bits, like “for all A and B, if A is provable and A→B is provable, then B is provable as well”. All such shortcuts will be learned in full generality as quickly as possible.
Can you cite the results you are relying on? There is this paper from by 2011 by Lattimore, Hutter and Gavane which shows that (a particular version of) Solomonoff induction can predict deterministic regular subsequences. However, it doesn’t cover fully general relationships between bits. For example, if future bit i is predictably (according to some computable regularity) equal to future bit j, afaik it is not proved that Solomonoff induction will assign both bits the same probability (it is only proved that it will correctly predict the later of the two bits after seeing all bits that precede it). See also the “Open Questions” section in the paper (there is no known stochastic generalization or known convergence rate). In fact, the ability to deal with fully general patterns is, I think, the main advantage of logical induction (that, and the fact it combines this ability with applying computational resource bounds).
Well, you’re right. I’ll put a retraction in the post. Thank you!
By asking for citations you’re giving me too much credit—I’m playing the game a few levels below you. I was going off the “obvious” idea that if our prior gives nonzero probability to a hypothesis, and we feed the prior a sequence of bits that matches the hypothesis, then the posterior probability of the hypothesis should rise to 1. Took me some time to realize that’s not true at all!
Here’s a simple counterexample. Say our hypothesis is “all even bits are 0”, and our prior is an equal mixture of “all even bits are 0 and all odd bits are random” and “all odd bits are 0 and all even bits are random”. Note that our hypothesis starts out with probability 1⁄2 according to the prior. But if we feed the prior a sequence of all 0′s, which matches the hypothesis, the posterior probability of the hypothesis won’t go to 1. It will keep oscillating forever. The prior just happens to be too good at guessing this particular sequence without help from our hypothesis.
I’d be surprised if something like that happened with computable hypotheses and the universal prior, but I don’t have a proof and couldn’t find one in a few hours. So thanks again.
Further, If Solomonoff Induction does get these problems right, it does so because closure properties on the class of hypotheses, not because of properties of the way in which the hypotheses are combined.
In the Logical Induction framework, if you add a bunch of other uncomputable hypotheses, you will still get the good properties on the predictable sub-patterns of the environment.
If you start with the Solomonoff Induction framework, this is demonstrably not true: If I have an environment which is 1 on even bits and uncomputable on odd bits, I can add an uncomputable hypothesis that knows all the odd bits. It can gain trust over time, then spend that trust to say that the next even bit has a 90% chance to be a 0. It will take a hit from this prediction, but can earn back the trust from the odd bits and repeat.
Isn’t that a complaint against Bayes, not just Solomonoff? Take any prior P. Take a sequence S whose even bits are all 1 and whose odd bits maximally disagree with P. Take another prior H that exactly knows the odd bits of S but thinks the even bits are random. Now the prior (P+H)/2 will never learn that all even bits of S are 1.
So I’m not quite ready to buy that complaint, because it seems to me that any good method should be at least approximately Bayesian. But maybe I don’t understand enough about what you’re doing...
It is a complaint against Bayes, but it is only a complaint against using Bayes in cases where the real world is probability 0 in your prior.
Part of the point of logical induction is that logic is complicated and no hypothesis in the logical induction algorithm can actually predict it correctly in full, but the algorithm allows for the hypotheses to prove themselves on a sub-pattern, and have the ensemble converge to the correct behavior on that sub-pattern.
Is there a simple “continuous” description of the class of objects that LI belongs to, which shows the point of departure from Bayes without relying on all details of LI? (For example, “it’s like a prior but the result also depends on ordering of input facts”.)
You can generalize LI to arbitrary collections of hypotheses, and interpret it as being about bit sequences rather logic, but not much more than that.
The reason the LI paper talks about the LI criterion rather than a specific algorithm is to push in that direction, but it is not as clean as your example.
I’m not sure I understand the question correctly, but what “LI” actually depends on is, more or less, a collection of traders plus a “prior” over them (although you can’t interpret it as an actual prior since more than one trader can be important in understanding a given environment). Plus there is some ambiguity in the process of choosing fixed points (because there might be multiple fixed points).
Can you cite the results you are relying on? There is this paper from by 2011 by Lattimore, Hutter and Gavane which shows that (a particular version of) Solomonoff induction can predict deterministic regular subsequences. However, it doesn’t cover fully general relationships between bits. For example, if future bit i is predictably (according to some computable regularity) equal to future bit j, afaik it is not proved that Solomonoff induction will assign both bits the same probability (it is only proved that it will correctly predict the later of the two bits after seeing all bits that precede it). See also the “Open Questions” section in the paper (there is no known stochastic generalization or known convergence rate). In fact, the ability to deal with fully general patterns is, I think, the main advantage of logical induction (that, and the fact it combines this ability with applying computational resource bounds).
Well, you’re right. I’ll put a retraction in the post. Thank you!
By asking for citations you’re giving me too much credit—I’m playing the game a few levels below you. I was going off the “obvious” idea that if our prior gives nonzero probability to a hypothesis, and we feed the prior a sequence of bits that matches the hypothesis, then the posterior probability of the hypothesis should rise to 1. Took me some time to realize that’s not true at all!
Here’s a simple counterexample. Say our hypothesis is “all even bits are 0”, and our prior is an equal mixture of “all even bits are 0 and all odd bits are random” and “all odd bits are 0 and all even bits are random”. Note that our hypothesis starts out with probability 1⁄2 according to the prior. But if we feed the prior a sequence of all 0′s, which matches the hypothesis, the posterior probability of the hypothesis won’t go to 1. It will keep oscillating forever. The prior just happens to be too good at guessing this particular sequence without help from our hypothesis.
I’d be surprised if something like that happened with computable hypotheses and the universal prior, but I don’t have a proof and couldn’t find one in a few hours. So thanks again.
Further, If Solomonoff Induction does get these problems right, it does so because closure properties on the class of hypotheses, not because of properties of the way in which the hypotheses are combined.
In the Logical Induction framework, if you add a bunch of other uncomputable hypotheses, you will still get the good properties on the predictable sub-patterns of the environment.
If you start with the Solomonoff Induction framework, this is demonstrably not true: If I have an environment which is 1 on even bits and uncomputable on odd bits, I can add an uncomputable hypothesis that knows all the odd bits. It can gain trust over time, then spend that trust to say that the next even bit has a 90% chance to be a 0. It will take a hit from this prediction, but can earn back the trust from the odd bits and repeat.
Isn’t that a complaint against Bayes, not just Solomonoff? Take any prior P. Take a sequence S whose even bits are all 1 and whose odd bits maximally disagree with P. Take another prior H that exactly knows the odd bits of S but thinks the even bits are random. Now the prior (P+H)/2 will never learn that all even bits of S are 1.
So I’m not quite ready to buy that complaint, because it seems to me that any good method should be at least approximately Bayesian. But maybe I don’t understand enough about what you’re doing...
It is a complaint against Bayes, but it is only a complaint against using Bayes in cases where the real world is probability 0 in your prior.
Part of the point of logical induction is that logic is complicated and no hypothesis in the logical induction algorithm can actually predict it correctly in full, but the algorithm allows for the hypotheses to prove themselves on a sub-pattern, and have the ensemble converge to the correct behavior on that sub-pattern.
Is there a simple “continuous” description of the class of objects that LI belongs to, which shows the point of departure from Bayes without relying on all details of LI? (For example, “it’s like a prior but the result also depends on ordering of input facts”.)
Not really.
You can generalize LI to arbitrary collections of hypotheses, and interpret it as being about bit sequences rather logic, but not much more than that.
The reason the LI paper talks about the LI criterion rather than a specific algorithm is to push in that direction, but it is not as clean as your example.
I’m not sure I understand the question correctly, but what “LI” actually depends on is, more or less, a collection of traders plus a “prior” over them (although you can’t interpret it as an actual prior since more than one trader can be important in understanding a given environment). Plus there is some ambiguity in the process of choosing fixed points (because there might be multiple fixed points).