suppose that you find evidence that Q(x) is true of exactly 90% of all numbers between x=1 and x=10^100. Then we would expect that if you plug in some arbitrary number in this range, say n = floor(7^7^7^7 * pi) mod 10^100, then the posterior probability of Q(n), after conditioning on your evidence, would be 0.9. But it turns out that in Demski’s proposal, the probability of Q(n) would be approximately 0.5.
This is because the task you’re talking about is a task for generalizing based on a pattern rather than on proofs, which early proposals for assigning logical probabilities didn’t include. I have literally written an article about this, check it out.
I’ll plug my own posts as many times as it takes :P
Hi Manfred, your post indeed touches on many of the same issues we were considering on the workshop. If I understand it correctly, you are worried about the robot correctly learning universal generalizations. For example, if P(x) is true for all odd x and false for all even x, you want your robot to correctly guess that after observing P(0),P(1),P(2),...,P(1 million). In fact, most current proposals, including Demski’s, are capable of learning these universal generalizations. Our current programs are able to correctly learn rules like “P(x) is true for all odd x,” or even “P(x) is true for 90% of x between 0 and 10 million.” You could call this step “generalization.” There are many interesting things to say about generalization, including your post and Will Sawin’s Pi1-Pi2 Theorem. The problem Paul brought to our attention at this workshop however was not generalization, but “specialization.” If you believe that “P(x) is true for all odd x,” it’s rather easy to say what you believe about P(994543). But it’s surprisingly subtle to understand what you should believe about P(994543) given only that “P(x) is true for 90% of x between 0 and 10 million,” particularly when P(994543) is part of a different sequence of propositions Q(x) which you have different statistical beliefs about. Properly handling the “specialization” step was the focus of the workshop, and I’d be interested to hear if you have any thoughts about the subtleties brought up in the write-up.
Well, I wrote down some stream of consciousness stuff :P
The statement “P(x) is true for 90% of x in some domain” is an interesting one. A message-length or complexity based prior over patterns (like I justified the use of in my post, and as is commonly used in things like minimum message length prediction) does not have statements like that as basic statements. Instead the basic statements are about the different patterns that could correspond to the values of P(x) over its domain.
If you’ve done random sampling on domain x and gotten P(x) true 90% of the time, then a robot using a complexity-based prior refuses to throw away the information about what you sampled, and just tries to find simple patterns in your data. This does lead it to predict that some additional statement has high probability of being true, because simple patterns tend to be consistent. But it does not assign equal probability to the truth of each P(x), because different x’s can fit into different patterns.
So a complexity prior over functions exhibits the correct behavior, at least when you give it the raw data. But what if we give if the information “P(x) is true for 90% of x in some domain?”
Well, there are lots of different patterns with P(x) true for 90% of x. But the math is fairly straightfoward—our robot takes its prior, cuts out every pattern that doesn’t have P(x) true for 90% of x, and then renormalizes to get its new distribution over patterns. Again, not every x has the same probability, because simple patterns contribute more, but the probability is around 90% for various x.
However, at no point does our complexity-prior robot ever think of the statement “P(x) is true for 90% of x in some domain” on its own. To do that, we’d have to make a robot that approximated by throwing out the information of what the sampled P(x) were, and only keeping the frequencies, and then didn’t consider any information about the complexity of the function (equivalent to a uniform non-standard prior), so that it didn’t update to a more complicated distribution.
This is a reasonable path to take for approximation. After all, the complexity prior isn’t going to be computationally tractable for many real-world problems. Throwing out the sampled P(x) is obviously bad if you have enough computational resources to use that information (e.g. imagine if 50% were true and it was just the odd numbers), so along with approximation schemes, we should remember to think about ways of controlling and doing cost/benefit analyses on those approximations.
Hi Manfred,
It might help to communicate if you read the report, though I’m not sure it’s understandable since it was written quite rapidly :(. Thanks for sending your stream of consciousness and I look forward to hearing more of it. You seem quite convinced that a simplicity prior exhibits reasonable behavior, but I am less so. A lot of the workshop involved understanding potential pitfalls of simplicity priors. Basically, some pathologies appear to arise from the fact that a very large pseudorandom number with a short description is treated very differently from a truly random large number with no short description.
It is difficult to analyze simplicity priors since they are so intractably complicated, but do you believe that, after updating on the first 1 million values of a random 90% true sequence of propositions P(0), P(1),...,P(1 million), P(x) will be assigned similar truth probabilities both when x admits short description and when x admits only a complicated description?
I confess I’m ignorant about what report you mean—could you throw me a link?
Hm, x being simple or complicated...
For example, if our predicate P(x) is true just for prime numbers, then “the trillionth prime number” is a simple description—no prime number can have a greater complexity than its description qua prime number, and so if we assign high probability to simple descriptions like primeness, we’ll expect P to be true for simple numbers.
I would argue that this is proper behavior. It is not that our robot cares about the complexity of the number. It just tries to find simple patterns in the data, and naturally some numbers will fit into more simple patterns than others, once we’re given information about the density of P(x) being true.
An extreme example would be if P(x) were only true (or, equivalently, false) for a single number. Then our probability about P(x) really would be about complexity of the number. Why is this proper behavior? Because information that justifies pattern-matching behavior is information about the complexity of P(x)! If P(x) is a single number and we know that its description is less than 100 bits, that knowledge is what allows us to exhibit a preference for simpler patterns, and thus simpler numbers.
Fun fact: if we know that P(x) is true for 50% of numbers, the complexity of x has essentially no impact, even though we still favor simple patterns.
It looks like what I was talking about and what you were talking about were different things. I was talking about having a complexity-based prior over the behavior of our predicate P(x) on the basis of actually knowing about the complexity of P.
The bad behavior you wrote up, though, seems like a consequence of using a shoddy method for updating on outside information.
If we just want to throw computing power at the problem until it goes away, the solution is actually really simple: you pick a model at random like normal, but if it’s not consistent with the information we’re conditioning on, you throw it out. Note that this requires even probabilistic-seeming information to be phrased as frequencies, like “P(x) is true for 90% of some domain.” But since this whole method of assigning probabilities is built off of frequencies, you’re probably fine with that.
Can I ask a non-sequitur question? If I were to write up some posts about logical probability, largely reprising the analogous trains of thought in the first few chapters of Jaynes, what would be a good comprehensive strategy to get the people attending workshops like this to read them?
you pick a model at random like normal, but if it’s not consistent with the information we’re conditioning on, you throw it out.
You can do this for a finite set S by picking a random assignment of truth values to sentences in S, but (1) there is a nasty dependence on S, so the resulting distribution is very sensitive to the choice of S and definitely doesn’t converge to any limit (this is related to the difficulty of defining a “random model” of some axioms), (2) if S is a finite set, you actually need to have some sentence like “P(x) is true for 90% of x such that ‘P(x)’ is in S” which is really awkward, and I don’t know how to make it actually work.
A technically correct solution along these lines would still be of interest, but my intuition is that this isn’t working for the right reasons and so there are likely to be further problems even if (2) is resolved and (1) is deemed non-problematic.
Also note that “pick a random model consistent with information we are conditioning on” is not the kind of solution we are looking for; we want conditioning to be conditioning. So your proposal would correspond to picking sentences at random for a while as in Abram Demski’s proposal, and then transitioning into choosing a model at random at some (probably randomly chosen) point. This is something we considered explicitly.
Anyhow, the question is, why is throwing out non-90% models and trying again going to make [the probability that we assign true to P(x) using this random process] behave like we want the probability that P(x) is true to behave?
We can answer this with an analogy to updating on new information. If we have a probability distribution over models, and we learn that the correct model says that 90% of P(x) are true in some domain, what we do is we zero out the probability of all models where that’s false, and normalize the remaining probabilities to get our new distribution. All this “output of the random process” stuff is really just describing a process that has some probability of outputting different models (that, is, Abram’s process outputs a model drawn from some distribution, and then we call the probability that P(x) is true as the probability that the output of Abram’s process assigns true to P(x)).
So the way you do updating is you zero out the probability that this process outputs a model where the conditioned-upon information is false, and then you normalize the outputs so that the process outputs one of the remaining models with the same relative frequencies. This is the same behavior as updating a probability distribution.
--
One thing I think you might mean by “we want conditioning to be conditioning” is that you don’t want to store a (literal or effective) distribution over models, and then condition by updating that distribution and recalculating the probability of a statement. You want to store the probability of statements, and condition by doing something to that probability. Like, P(A|B) = P(AB)/P(B).
I like the aesthetics of that too—my first suggestion for logical probability was based off of storing probabilities of statements, after all. But to make things behave at all correctly, you need more than just that, you also need to be able to talk about correlations between probabilities. The easiest way to represent that? Truth tables.
Yeah, updating probabilty distributions over models is believed to be good. The problem is, sometimes our probability distributions over models are wrong, as demonstrated by bad behavior when we update on certain info.
The kind of data that would make you want to zeroi out non-90% models. Is when you observe a bunch of random data points and 90% of them are true, but there are no other patterns you can detect.
The other problem is that updates can be hard to compute.
I’m confused about what you want when you say this. I think an example or two would help me.
My best guess: you want the assigned probability to equal the “ideal” probability (as laid out in the probabilistic reflection paper) when we take the limit of computing power going to infinity. It’s difficult to take this limit for Abram’s original proposal, but if you make it doable by restricting it to S, now there can be things that are true but not provable using statements in S, so infinite computing power doesn’t get us there.
It’s actually not too hard to demonstrate things about the limit for Abram’s original proposal, unless there’s another one that’s original-er than the one I’m thinking of. It limits to the distribution of outcomes of a certain incomputable random process which uses a halting oracle to tell when certain statements are contradictory.
You are correct that it doesn’t converge to a limit of assigning 1 to true statements and 0 to false statements. This is of course impossible, so we don’t have to accept it. But it seems like we should not have to accept divergence—believing something with high probability, then disbelieving with high probability, then believing again, etc. Or perhaps we should?
This is because the task you’re talking about is a task for generalizing based on a pattern rather than on proofs, which early proposals for assigning logical probabilities didn’t include. I have literally written an article about this, check it out.
I’ll plug my own posts as many times as it takes :P
Hi Manfred, your post indeed touches on many of the same issues we were considering on the workshop. If I understand it correctly, you are worried about the robot correctly learning universal generalizations. For example, if P(x) is true for all odd x and false for all even x, you want your robot to correctly guess that after observing P(0),P(1),P(2),...,P(1 million). In fact, most current proposals, including Demski’s, are capable of learning these universal generalizations. Our current programs are able to correctly learn rules like “P(x) is true for all odd x,” or even “P(x) is true for 90% of x between 0 and 10 million.” You could call this step “generalization.” There are many interesting things to say about generalization, including your post and Will Sawin’s Pi1-Pi2 Theorem. The problem Paul brought to our attention at this workshop however was not generalization, but “specialization.” If you believe that “P(x) is true for all odd x,” it’s rather easy to say what you believe about P(994543). But it’s surprisingly subtle to understand what you should believe about P(994543) given only that “P(x) is true for 90% of x between 0 and 10 million,” particularly when P(994543) is part of a different sequence of propositions Q(x) which you have different statistical beliefs about. Properly handling the “specialization” step was the focus of the workshop, and I’d be interested to hear if you have any thoughts about the subtleties brought up in the write-up.
Well, I wrote down some stream of consciousness stuff :P
The statement “P(x) is true for 90% of x in some domain” is an interesting one. A message-length or complexity based prior over patterns (like I justified the use of in my post, and as is commonly used in things like minimum message length prediction) does not have statements like that as basic statements. Instead the basic statements are about the different patterns that could correspond to the values of P(x) over its domain.
If you’ve done random sampling on domain x and gotten P(x) true 90% of the time, then a robot using a complexity-based prior refuses to throw away the information about what you sampled, and just tries to find simple patterns in your data. This does lead it to predict that some additional statement has high probability of being true, because simple patterns tend to be consistent. But it does not assign equal probability to the truth of each P(x), because different x’s can fit into different patterns.
So a complexity prior over functions exhibits the correct behavior, at least when you give it the raw data. But what if we give if the information “P(x) is true for 90% of x in some domain?”
Well, there are lots of different patterns with P(x) true for 90% of x. But the math is fairly straightfoward—our robot takes its prior, cuts out every pattern that doesn’t have P(x) true for 90% of x, and then renormalizes to get its new distribution over patterns. Again, not every x has the same probability, because simple patterns contribute more, but the probability is around 90% for various x.
However, at no point does our complexity-prior robot ever think of the statement “P(x) is true for 90% of x in some domain” on its own. To do that, we’d have to make a robot that approximated by throwing out the information of what the sampled P(x) were, and only keeping the frequencies, and then didn’t consider any information about the complexity of the function (equivalent to a uniform non-standard prior), so that it didn’t update to a more complicated distribution.
This is a reasonable path to take for approximation. After all, the complexity prior isn’t going to be computationally tractable for many real-world problems. Throwing out the sampled P(x) is obviously bad if you have enough computational resources to use that information (e.g. imagine if 50% were true and it was just the odd numbers), so along with approximation schemes, we should remember to think about ways of controlling and doing cost/benefit analyses on those approximations.
Hi Manfred, It might help to communicate if you read the report, though I’m not sure it’s understandable since it was written quite rapidly :(. Thanks for sending your stream of consciousness and I look forward to hearing more of it. You seem quite convinced that a simplicity prior exhibits reasonable behavior, but I am less so. A lot of the workshop involved understanding potential pitfalls of simplicity priors. Basically, some pathologies appear to arise from the fact that a very large pseudorandom number with a short description is treated very differently from a truly random large number with no short description.
It is difficult to analyze simplicity priors since they are so intractably complicated, but do you believe that, after updating on the first 1 million values of a random 90% true sequence of propositions P(0), P(1),...,P(1 million), P(x) will be assigned similar truth probabilities both when x admits short description and when x admits only a complicated description?
I confess I’m ignorant about what report you mean—could you throw me a link?
Hm, x being simple or complicated...
For example, if our predicate P(x) is true just for prime numbers, then “the trillionth prime number” is a simple description—no prime number can have a greater complexity than its description qua prime number, and so if we assign high probability to simple descriptions like primeness, we’ll expect P to be true for simple numbers.
I would argue that this is proper behavior. It is not that our robot cares about the complexity of the number. It just tries to find simple patterns in the data, and naturally some numbers will fit into more simple patterns than others, once we’re given information about the density of P(x) being true.
An extreme example would be if P(x) were only true (or, equivalently, false) for a single number. Then our probability about P(x) really would be about complexity of the number. Why is this proper behavior? Because information that justifies pattern-matching behavior is information about the complexity of P(x)! If P(x) is a single number and we know that its description is less than 100 bits, that knowledge is what allows us to exhibit a preference for simpler patterns, and thus simpler numbers.
Fun fact: if we know that P(x) is true for 50% of numbers, the complexity of x has essentially no impact, even though we still favor simple patterns.
Hi Manfred, the link is in the theme 1and section of the original post. Let me know if it clarifies our worries about simplicity priors.
It looks like what I was talking about and what you were talking about were different things. I was talking about having a complexity-based prior over the behavior of our predicate P(x) on the basis of actually knowing about the complexity of P.
The bad behavior you wrote up, though, seems like a consequence of using a shoddy method for updating on outside information.
If we just want to throw computing power at the problem until it goes away, the solution is actually really simple: you pick a model at random like normal, but if it’s not consistent with the information we’re conditioning on, you throw it out. Note that this requires even probabilistic-seeming information to be phrased as frequencies, like “P(x) is true for 90% of some domain.” But since this whole method of assigning probabilities is built off of frequencies, you’re probably fine with that.
Can I ask a non-sequitur question? If I were to write up some posts about logical probability, largely reprising the analogous trains of thought in the first few chapters of Jaynes, what would be a good comprehensive strategy to get the people attending workshops like this to read them?
You can do this for a finite set S by picking a random assignment of truth values to sentences in S, but (1) there is a nasty dependence on S, so the resulting distribution is very sensitive to the choice of S and definitely doesn’t converge to any limit (this is related to the difficulty of defining a “random model” of some axioms), (2) if S is a finite set, you actually need to have some sentence like “P(x) is true for 90% of x such that ‘P(x)’ is in S” which is really awkward, and I don’t know how to make it actually work.
A technically correct solution along these lines would still be of interest, but my intuition is that this isn’t working for the right reasons and so there are likely to be further problems even if (2) is resolved and (1) is deemed non-problematic.
Also note that “pick a random model consistent with information we are conditioning on” is not the kind of solution we are looking for; we want conditioning to be conditioning. So your proposal would correspond to picking sentences at random for a while as in Abram Demski’s proposal, and then transitioning into choosing a model at random at some (probably randomly chosen) point. This is something we considered explicitly.
Anyhow, the question is, why is throwing out non-90% models and trying again going to make [the probability that we assign true to P(x) using this random process] behave like we want the probability that P(x) is true to behave?
We can answer this with an analogy to updating on new information. If we have a probability distribution over models, and we learn that the correct model says that 90% of P(x) are true in some domain, what we do is we zero out the probability of all models where that’s false, and normalize the remaining probabilities to get our new distribution. All this “output of the random process” stuff is really just describing a process that has some probability of outputting different models (that, is, Abram’s process outputs a model drawn from some distribution, and then we call the probability that P(x) is true as the probability that the output of Abram’s process assigns true to P(x)).
So the way you do updating is you zero out the probability that this process outputs a model where the conditioned-upon information is false, and then you normalize the outputs so that the process outputs one of the remaining models with the same relative frequencies. This is the same behavior as updating a probability distribution.
--
One thing I think you might mean by “we want conditioning to be conditioning” is that you don’t want to store a (literal or effective) distribution over models, and then condition by updating that distribution and recalculating the probability of a statement. You want to store the probability of statements, and condition by doing something to that probability. Like, P(A|B) = P(AB)/P(B).
I like the aesthetics of that too—my first suggestion for logical probability was based off of storing probabilities of statements, after all. But to make things behave at all correctly, you need more than just that, you also need to be able to talk about correlations between probabilities. The easiest way to represent that? Truth tables.
Yeah, updating probabilty distributions over models is believed to be good. The problem is, sometimes our probability distributions over models are wrong, as demonstrated by bad behavior when we update on certain info.
The kind of data that would make you want to zeroi out non-90% models. Is when you observe a bunch of random data points and 90% of them are true, but there are no other patterns you can detect.
The other problem is that updates can be hard to compute.
I’m confused about what you want when you say this. I think an example or two would help me.
My best guess: you want the assigned probability to equal the “ideal” probability (as laid out in the probabilistic reflection paper) when we take the limit of computing power going to infinity. It’s difficult to take this limit for Abram’s original proposal, but if you make it doable by restricting it to S, now there can be things that are true but not provable using statements in S, so infinite computing power doesn’t get us there.
It’s actually not too hard to demonstrate things about the limit for Abram’s original proposal, unless there’s another one that’s original-er than the one I’m thinking of. It limits to the distribution of outcomes of a certain incomputable random process which uses a halting oracle to tell when certain statements are contradictory.
You are correct that it doesn’t converge to a limit of assigning 1 to true statements and 0 to false statements. This is of course impossible, so we don’t have to accept it. But it seems like we should not have to accept divergence—believing something with high probability, then disbelieving with high probability, then believing again, etc. Or perhaps we should?