I am not convinced by the problematic example in the “Scientific Induction in Probabilistic Mathematics” writeup.
Let’s say that there are n atoms ϕ(1)..ϕ(n). If you don’t condition, then because of symmetry, all consistent sets S drawn from the process have equal probability. So the prior on S is uniform and the probability of ϕ(i) is therefore 1⁄2, by
P(ϕ(i)) = ∑{S} 1[ϕ(i)∈S] * P(S)
n a consistent set S drawn from the process is exactly 1⁄2 for all i, this must be true by symmetry because μ(x)=μ(¬x).
Now what you should do to condition on some statement X is simply throw out the sets S which don’t satisfy that statement, i.e.
Which should just be 90% in the example where X is “90% of the ϕ are true”
The mistake in the writeup is to directly define P(S|X) in an inconsistent way.
To avoid drowning in notation, let’s consider a simpler example with the variables a, b and c. We will first pick a or ¬a uniformly, then b or ¬b, and finally c or ¬c. Then we try to condition on X=”exactly one of a,b,c is true”.
You obviously get prior probabilities P(S) = 1⁄8 for all consistent sets.
If you condition the right way, you get P(S) = 1⁄3 for the sets with one true attom, and P(S)=0 for the other sets. So then
What the writeup does instead is first pick a or ¬a uniformly. If it picks a, we know that b and c are false. If we pick ¬a we continue. The uniform choice of a is akin to saying that
Sorry if there is something fishy in the writeup :(. I could believe it, given how rushed I was writing it.
Suppose we consider not just a,~a,b,~b, and c,~c, but also statements q=”exactly one of a,b,c is true” and ~q. Suppose now that we uniformly pick a truth value for a, then for q, then a logically consistent but otherwise random value for b, and finally a logically consistent but otherwise random value for c. Such an asymmetric situation could occur if b and c have high mu but a and q have small mu. In the worlds where we believe q, b and c are much more often disbelieved than a. I believe that basically captures the worries about Demski’s scheme that Paul was having; maybe he will comment himself.
I’m still not sure whether I find it too concerning. We assign higher probability to simpler theories so long as they follow the constraint that Q(n) below 10^100 is 90%. The theory which randomly assigns Q(n) to true/false for 10^99 simple values of n (and then gets forced to assign Q as true for most of the remaining n below 10^100) is just one form of theory which may be generated by the process, and not a really simple one. The theory that Q(n) represents “not divisible by 10” is going to be much more probable than all these theories.
In other words, the write-up estimates Q(n) based on a narrow subset of the theories which assign probability mass. I don’t really think it’s representative...
Imagine that we encounter a truly iid random sequence of 90% likely propositions Q(0),Q(1),Q(2),…
Perhaps they are merely pseudorandom but impossibly complicated to reason about, or perhaps they represent some random external output that an agent observes.
After observing a very large number of these Q(i), one might expect to place high probability on something like “About 90% of the next 10^100 Q(j) I haven’t observed yet will be true,” but there is unlikely to be any simple rule that describes the already observed Q(i). Do you think that the next 10^100 Q(j) will all individually be believed 90% likely to be true, or will the simpler to describe Q(j) receive closer to 50% probability?
We can show that the FOL prior is not too different from the algorithmic prior, so it can’t perform too badly for problems where algorithmic induction does well. Partial theories which imply probabilities close to .9 but do not specify exact predictions will eventually have high probability; for example, a theory might specify that Q(x) is derived from an unspecified F(x) and G(x) (treated as random sources) getting OR’d together, making probabilities roughly .75; variations of this would bring things closer to .9.
This still may still assign simpler Q(j) to closer to 50% probability.
After writing this I realize that there is a much simpler prior on finite sets S of consistent statements: simply have a prior over all sets of statements, and keep only the consistent ones. If your language is chosen such that it contains X if and only if it also contains ¬X, then this is equivalent to choosing a truth value for each basic statement, and a uniform prior over these valuations would work fine.
The key here is that you are using finite S. What do you do if S is infinite? More concretely, is your schema convergent if you grow your finite S by adding more and more statements? I believe we touch on such worries in the writeup.
I am not convinced by the problematic example in the “Scientific Induction in Probabilistic Mathematics” writeup. Let’s say that there are n atoms ϕ(1)..ϕ(n). If you don’t condition, then because of symmetry, all consistent sets S drawn from the process have equal probability. So the prior on S is uniform and the probability of ϕ(i) is therefore 1⁄2, by
n a consistent set S drawn from the process is exactly 1⁄2 for all i, this must be true by symmetry because μ(x)=μ(¬x). Now what you should do to condition on some statement X is simply throw out the sets S which don’t satisfy that statement, i.e.
Since the prior on S was uniform, it will still be uniform on the restricted set after conditioning. So
Which should just be 90% in the example where X is “90% of the ϕ are true”
The mistake in the writeup is to directly define P(S|X) in an inconsistent way.
To avoid drowning in notation, let’s consider a simpler example with the variables a, b and c. We will first pick a or ¬a uniformly, then b or ¬b, and finally c or ¬c. Then we try to condition on X=”exactly one of a,b,c is true”. You obviously get prior probabilities P(S) = 1⁄8 for all consistent sets.
If you condition the right way, you get P(S) = 1⁄3 for the sets with one true attom, and P(S)=0 for the other sets. So then
What the writeup does instead is first pick a or ¬a uniformly. If it picks a, we know that b and c are false. If we pick ¬a we continue. The uniform choice of a is akin to saying that
But that first term should be
P(a|X)
, notP(a)
!Sorry if there is something fishy in the writeup :(. I could believe it, given how rushed I was writing it.
Suppose we consider not just a,~a,b,~b, and c,~c, but also statements q=”exactly one of a,b,c is true” and ~q. Suppose now that we uniformly pick a truth value for a, then for q, then a logically consistent but otherwise random value for b, and finally a logically consistent but otherwise random value for c. Such an asymmetric situation could occur if b and c have high mu but a and q have small mu. In the worlds where we believe q, b and c are much more often disbelieved than a. I believe that basically captures the worries about Demski’s scheme that Paul was having; maybe he will comment himself.
Does that clarify anything?
This was surprisingly clarifying for me.
I’m still not sure whether I find it too concerning. We assign higher probability to simpler theories so long as they follow the constraint that Q(n) below 10^100 is 90%. The theory which randomly assigns Q(n) to true/false for 10^99 simple values of n (and then gets forced to assign Q as true for most of the remaining n below 10^100) is just one form of theory which may be generated by the process, and not a really simple one. The theory that Q(n) represents “not divisible by 10” is going to be much more probable than all these theories.
In other words, the write-up estimates Q(n) based on a narrow subset of the theories which assign probability mass. I don’t really think it’s representative...
Imagine that we encounter a truly iid random sequence of 90% likely propositions Q(0),Q(1),Q(2),… Perhaps they are merely pseudorandom but impossibly complicated to reason about, or perhaps they represent some random external output that an agent observes. After observing a very large number of these Q(i), one might expect to place high probability on something like “About 90% of the next 10^100 Q(j) I haven’t observed yet will be true,” but there is unlikely to be any simple rule that describes the already observed Q(i). Do you think that the next 10^100 Q(j) will all individually be believed 90% likely to be true, or will the simpler to describe Q(j) receive closer to 50% probability?
We can show that the FOL prior is not too different from the algorithmic prior, so it can’t perform too badly for problems where algorithmic induction does well. Partial theories which imply probabilities close to .9 but do not specify exact predictions will eventually have high probability; for example, a theory might specify that Q(x) is derived from an unspecified F(x) and G(x) (treated as random sources) getting OR’d together, making probabilities roughly .75; variations of this would bring things closer to .9.
This still may still assign simpler Q(j) to closer to 50% probability.
After writing this I realize that there is a much simpler prior on finite sets S of consistent statements: simply have a prior over all sets of statements, and keep only the consistent ones. If your language is chosen such that it contains X if and only if it also contains ¬X, then this is equivalent to choosing a truth value for each basic statement, and a uniform prior over these valuations would work fine.
The key here is that you are using finite S. What do you do if S is infinite? More concretely, is your schema convergent if you grow your finite S by adding more and more statements? I believe we touch on such worries in the writeup.