I confess I’m ignorant about what report you mean—could you throw me a link?
Hm, x being simple or complicated...
For example, if our predicate P(x) is true just for prime numbers, then “the trillionth prime number” is a simple description—no prime number can have a greater complexity than its description qua prime number, and so if we assign high probability to simple descriptions like primeness, we’ll expect P to be true for simple numbers.
I would argue that this is proper behavior. It is not that our robot cares about the complexity of the number. It just tries to find simple patterns in the data, and naturally some numbers will fit into more simple patterns than others, once we’re given information about the density of P(x) being true.
An extreme example would be if P(x) were only true (or, equivalently, false) for a single number. Then our probability about P(x) really would be about complexity of the number. Why is this proper behavior? Because information that justifies pattern-matching behavior is information about the complexity of P(x)! If P(x) is a single number and we know that its description is less than 100 bits, that knowledge is what allows us to exhibit a preference for simpler patterns, and thus simpler numbers.
Fun fact: if we know that P(x) is true for 50% of numbers, the complexity of x has essentially no impact, even though we still favor simple patterns.
It looks like what I was talking about and what you were talking about were different things. I was talking about having a complexity-based prior over the behavior of our predicate P(x) on the basis of actually knowing about the complexity of P.
The bad behavior you wrote up, though, seems like a consequence of using a shoddy method for updating on outside information.
If we just want to throw computing power at the problem until it goes away, the solution is actually really simple: you pick a model at random like normal, but if it’s not consistent with the information we’re conditioning on, you throw it out. Note that this requires even probabilistic-seeming information to be phrased as frequencies, like “P(x) is true for 90% of some domain.” But since this whole method of assigning probabilities is built off of frequencies, you’re probably fine with that.
Can I ask a non-sequitur question? If I were to write up some posts about logical probability, largely reprising the analogous trains of thought in the first few chapters of Jaynes, what would be a good comprehensive strategy to get the people attending workshops like this to read them?
you pick a model at random like normal, but if it’s not consistent with the information we’re conditioning on, you throw it out.
You can do this for a finite set S by picking a random assignment of truth values to sentences in S, but (1) there is a nasty dependence on S, so the resulting distribution is very sensitive to the choice of S and definitely doesn’t converge to any limit (this is related to the difficulty of defining a “random model” of some axioms), (2) if S is a finite set, you actually need to have some sentence like “P(x) is true for 90% of x such that ‘P(x)’ is in S” which is really awkward, and I don’t know how to make it actually work.
A technically correct solution along these lines would still be of interest, but my intuition is that this isn’t working for the right reasons and so there are likely to be further problems even if (2) is resolved and (1) is deemed non-problematic.
Also note that “pick a random model consistent with information we are conditioning on” is not the kind of solution we are looking for; we want conditioning to be conditioning. So your proposal would correspond to picking sentences at random for a while as in Abram Demski’s proposal, and then transitioning into choosing a model at random at some (probably randomly chosen) point. This is something we considered explicitly.
Anyhow, the question is, why is throwing out non-90% models and trying again going to make [the probability that we assign true to P(x) using this random process] behave like we want the probability that P(x) is true to behave?
We can answer this with an analogy to updating on new information. If we have a probability distribution over models, and we learn that the correct model says that 90% of P(x) are true in some domain, what we do is we zero out the probability of all models where that’s false, and normalize the remaining probabilities to get our new distribution. All this “output of the random process” stuff is really just describing a process that has some probability of outputting different models (that, is, Abram’s process outputs a model drawn from some distribution, and then we call the probability that P(x) is true as the probability that the output of Abram’s process assigns true to P(x)).
So the way you do updating is you zero out the probability that this process outputs a model where the conditioned-upon information is false, and then you normalize the outputs so that the process outputs one of the remaining models with the same relative frequencies. This is the same behavior as updating a probability distribution.
--
One thing I think you might mean by “we want conditioning to be conditioning” is that you don’t want to store a (literal or effective) distribution over models, and then condition by updating that distribution and recalculating the probability of a statement. You want to store the probability of statements, and condition by doing something to that probability. Like, P(A|B) = P(AB)/P(B).
I like the aesthetics of that too—my first suggestion for logical probability was based off of storing probabilities of statements, after all. But to make things behave at all correctly, you need more than just that, you also need to be able to talk about correlations between probabilities. The easiest way to represent that? Truth tables.
Yeah, updating probabilty distributions over models is believed to be good. The problem is, sometimes our probability distributions over models are wrong, as demonstrated by bad behavior when we update on certain info.
The kind of data that would make you want to zeroi out non-90% models. Is when you observe a bunch of random data points and 90% of them are true, but there are no other patterns you can detect.
The other problem is that updates can be hard to compute.
I’m confused about what you want when you say this. I think an example or two would help me.
My best guess: you want the assigned probability to equal the “ideal” probability (as laid out in the probabilistic reflection paper) when we take the limit of computing power going to infinity. It’s difficult to take this limit for Abram’s original proposal, but if you make it doable by restricting it to S, now there can be things that are true but not provable using statements in S, so infinite computing power doesn’t get us there.
It’s actually not too hard to demonstrate things about the limit for Abram’s original proposal, unless there’s another one that’s original-er than the one I’m thinking of. It limits to the distribution of outcomes of a certain incomputable random process which uses a halting oracle to tell when certain statements are contradictory.
You are correct that it doesn’t converge to a limit of assigning 1 to true statements and 0 to false statements. This is of course impossible, so we don’t have to accept it. But it seems like we should not have to accept divergence—believing something with high probability, then disbelieving with high probability, then believing again, etc. Or perhaps we should?
I confess I’m ignorant about what report you mean—could you throw me a link?
Hm, x being simple or complicated...
For example, if our predicate P(x) is true just for prime numbers, then “the trillionth prime number” is a simple description—no prime number can have a greater complexity than its description qua prime number, and so if we assign high probability to simple descriptions like primeness, we’ll expect P to be true for simple numbers.
I would argue that this is proper behavior. It is not that our robot cares about the complexity of the number. It just tries to find simple patterns in the data, and naturally some numbers will fit into more simple patterns than others, once we’re given information about the density of P(x) being true.
An extreme example would be if P(x) were only true (or, equivalently, false) for a single number. Then our probability about P(x) really would be about complexity of the number. Why is this proper behavior? Because information that justifies pattern-matching behavior is information about the complexity of P(x)! If P(x) is a single number and we know that its description is less than 100 bits, that knowledge is what allows us to exhibit a preference for simpler patterns, and thus simpler numbers.
Fun fact: if we know that P(x) is true for 50% of numbers, the complexity of x has essentially no impact, even though we still favor simple patterns.
Hi Manfred, the link is in the theme 1and section of the original post. Let me know if it clarifies our worries about simplicity priors.
It looks like what I was talking about and what you were talking about were different things. I was talking about having a complexity-based prior over the behavior of our predicate P(x) on the basis of actually knowing about the complexity of P.
The bad behavior you wrote up, though, seems like a consequence of using a shoddy method for updating on outside information.
If we just want to throw computing power at the problem until it goes away, the solution is actually really simple: you pick a model at random like normal, but if it’s not consistent with the information we’re conditioning on, you throw it out. Note that this requires even probabilistic-seeming information to be phrased as frequencies, like “P(x) is true for 90% of some domain.” But since this whole method of assigning probabilities is built off of frequencies, you’re probably fine with that.
Can I ask a non-sequitur question? If I were to write up some posts about logical probability, largely reprising the analogous trains of thought in the first few chapters of Jaynes, what would be a good comprehensive strategy to get the people attending workshops like this to read them?
You can do this for a finite set S by picking a random assignment of truth values to sentences in S, but (1) there is a nasty dependence on S, so the resulting distribution is very sensitive to the choice of S and definitely doesn’t converge to any limit (this is related to the difficulty of defining a “random model” of some axioms), (2) if S is a finite set, you actually need to have some sentence like “P(x) is true for 90% of x such that ‘P(x)’ is in S” which is really awkward, and I don’t know how to make it actually work.
A technically correct solution along these lines would still be of interest, but my intuition is that this isn’t working for the right reasons and so there are likely to be further problems even if (2) is resolved and (1) is deemed non-problematic.
Also note that “pick a random model consistent with information we are conditioning on” is not the kind of solution we are looking for; we want conditioning to be conditioning. So your proposal would correspond to picking sentences at random for a while as in Abram Demski’s proposal, and then transitioning into choosing a model at random at some (probably randomly chosen) point. This is something we considered explicitly.
Anyhow, the question is, why is throwing out non-90% models and trying again going to make [the probability that we assign true to P(x) using this random process] behave like we want the probability that P(x) is true to behave?
We can answer this with an analogy to updating on new information. If we have a probability distribution over models, and we learn that the correct model says that 90% of P(x) are true in some domain, what we do is we zero out the probability of all models where that’s false, and normalize the remaining probabilities to get our new distribution. All this “output of the random process” stuff is really just describing a process that has some probability of outputting different models (that, is, Abram’s process outputs a model drawn from some distribution, and then we call the probability that P(x) is true as the probability that the output of Abram’s process assigns true to P(x)).
So the way you do updating is you zero out the probability that this process outputs a model where the conditioned-upon information is false, and then you normalize the outputs so that the process outputs one of the remaining models with the same relative frequencies. This is the same behavior as updating a probability distribution.
--
One thing I think you might mean by “we want conditioning to be conditioning” is that you don’t want to store a (literal or effective) distribution over models, and then condition by updating that distribution and recalculating the probability of a statement. You want to store the probability of statements, and condition by doing something to that probability. Like, P(A|B) = P(AB)/P(B).
I like the aesthetics of that too—my first suggestion for logical probability was based off of storing probabilities of statements, after all. But to make things behave at all correctly, you need more than just that, you also need to be able to talk about correlations between probabilities. The easiest way to represent that? Truth tables.
Yeah, updating probabilty distributions over models is believed to be good. The problem is, sometimes our probability distributions over models are wrong, as demonstrated by bad behavior when we update on certain info.
The kind of data that would make you want to zeroi out non-90% models. Is when you observe a bunch of random data points and 90% of them are true, but there are no other patterns you can detect.
The other problem is that updates can be hard to compute.
I’m confused about what you want when you say this. I think an example or two would help me.
My best guess: you want the assigned probability to equal the “ideal” probability (as laid out in the probabilistic reflection paper) when we take the limit of computing power going to infinity. It’s difficult to take this limit for Abram’s original proposal, but if you make it doable by restricting it to S, now there can be things that are true but not provable using statements in S, so infinite computing power doesn’t get us there.
It’s actually not too hard to demonstrate things about the limit for Abram’s original proposal, unless there’s another one that’s original-er than the one I’m thinking of. It limits to the distribution of outcomes of a certain incomputable random process which uses a halting oracle to tell when certain statements are contradictory.
You are correct that it doesn’t converge to a limit of assigning 1 to true statements and 0 to false statements. This is of course impossible, so we don’t have to accept it. But it seems like we should not have to accept divergence—believing something with high probability, then disbelieving with high probability, then believing again, etc. Or perhaps we should?