I agree with most of that, but why favor less information content? Though I may not fully understand the math, this recent post by cousin it seems to be saying that priors should not always depend on Kolmogorov complexity.
And, even if we do decide to favor less information content, how much emphasis should we place on it?
In general, I would think that the more information is in a theory, the more specific it is, and the more specific it is, the smaller is the proportion of possible worlds which happen to comply with it.
Regarding how much emphasis we should place on it: I woud say “a lot” but there are complications. Theories aren’t used in isolation, but tend to provide a kind of informally put together world view, and then there is the issue of degree of matching.
That isn’t Perplexed’s point. Let’s say that as of this moment all crows that have been observed are black, so both of his hypotheses fit the data. Why should “all crows are black” be assigned a higher prior than “All crows are black except ”? Based on cousin_it’s post, I don’t see any reason to do that.
So, to revive this discussion: if we must distribute probability mass evenly because we cannot place emphasis on simplicity, shouldn’t our priors be almost zero for every hypothesis? It seems to me that the “underdetermination” problem makes it very hard to use priors in a meaningful way.
I am assuming here that all the crows that we have previously seen have been black, and therefore that both theories have the same agreement, or at least approximate agreement, with what we know.
The second theory clearly has more information content.
Why would it not make sense to use the first theory on this basis?
The fact that all the crows we have seen so far are black makes it a good idea to assume black crows in future. There may be instances of non-black crows, when the theory has predicted black crows, but that simply means that the theory is not 100% accurate.
If the 270 pages of exceptions have not come from anywhere, then the fact that they are not justified just makes them random, unjustified specificity. Out of all the possible worlds we can imagine that are consistent with what we know, the proportion that agree with this specificity is going to be small. If most crows are black, as I am assuming our experience has suggested, then when this second theory predicts a non-black crow, as one of its exceptions, it will probably be wrong: The unjustified specificity is therefore contributing to a failure of the theory. On the other hand, when the occasional non-black crow does show up, there is no reason to think that the second theory is going to be much better at predicting this than the first theory—so the second theory would seem to have all the inaccuracies of wrongful black crow prediction of the first theory, along with extra errors of wrongful non-black crow prediction introduced by the unjustified specificity.
Now, if you want to say that we don’t have experience of mainly black crows, or that the 270 pages of exceptions come from somewhere, then that puts us into a different scenario: a more complicated one.
Looking at it in a simple way, however, I think this example actually just demonstrates that information in a theory should be minimized.
I haven’t been following the discussion on this topic very closely, so my response may be about stuff you already know or already know is wrong. But, since I’m feeling reckless today, I will try to say something interesting.
There are two different information metrics we can use regarding theories. The first deals with how informative a theory is about the world. The ideally informative theory tells us a lot about the world. Or, to say the same thing in different language, an informative theory rules out as many “possible worlds” as it can; it tells us that our own world is very special among all otherwise possible worlds; that the set of worlds consistent with the theory is a small set. We may as well call this kind of information Shannon information or S-information . A Karl Popper fan would approve of making a theory as S-informative as possible, because then it is exposing itself to the greatest risk of refutation.
The second information metric measures how much information is required to communicate the theory to someone. My 270 pages of fine print in the second crow theory might be an example of a theory with a lot of this kind of information. Let us call this kind of information Kolmogorov information, or K-information. My understanding of Occam’s razor is that it recommends that our theories should use as little K-information as possible.
So we have Occam telling us to minimize the K-information and Popper telling us to maximize the S-information. Luckily, the two types of information are not closely related, so (assuming that the universe does not conspire against us) we can frequently do reasonably well by both criteria. So much for the obvious and easy points.
The trouble appears, especially for biologists and other “squishy” scientists, when Nature seems to have set things up so that every law has some exceptions. I’ll leave it to you to Google on either “white crow” or “white raven” and to admire those fine and intelligent birds. So, given our objectives of maximizing one information measure and minimizing the other, how should we proceed? Do we change our law to say “99+% of crows are black?” Do we change it to say “All crows are black, not counting ravens as crows, and except for a fraction under 1% of crows which are albinos and also have pink eyes?” I don’t know, but maybe you have thought about it more than I have.
The trouble appears, especially for biologists and other “squishy” scientists, when Nature seems to have set things up so that every law has some exceptions. I’ll leave it to you to Google on either “white crow” or “white raven” and to admire those fine and intelligent birds. So, given our objectives of maximizing one information measure and minimizing the other, how should we proceed? Do we change our law to say “99+% of crows are black?” Do we change it to say “All crows are black, not counting ravens as crows, and except for a fraction under 1% of crows which are albinos and also have pink eyes?”
We change it to say, “99+% of crows have such-and-such alleles of genes for determining feather colour; certain other alleles are rare and result in a bird lacking feather pigments due to the synthesis pathway being broken at such-and-such a step for lack of such-and-such a protein. The mutation is disadvantageous, hence the absence of any substantial population of white crows.” (Or whatever the actual story is, I’m just making that one up.) If we don’t know the actual story, then the best we can do is say that for reasons we don’t know, it happens now and then that black crows can give birth to a white offspring.
Squishiness is not a property of biological phenomena, but of our knowledge of those phenomena. Exceptions are in our descriptions, not in Nature.
I wonder if it helps to arrange K-information in layers. You could start with “Almost all crows are black”, and then add footnotes for how rare white crows actually are, what causes them, how complete we think our information about crow color distribution is and why, and possibly some factors I haven’t thought of.
Layering or modularizing the hypothesis: Of course, you can do this, and you typically do do this. But, layering doesn’t typically change the total quantity of K-information. A complex hypothesis still has a lot of K-information whether you present it as neatly layered or just jumbled together. Which brings us to the issue of just why we bother calculating the K-information content of a hypothesis in the first place.
There is a notion, mentioned in Jaynes and also in another thread active right now, that the K-information content of a hypothesis is directly related to the prior probability that ought to be attached to a hypothesis (in the absence of (or prior to) empirical evidence). So, it seems to me that the interesting thing about your layering suggestion is how the layering should tie in to the Bayesian inference machinery which we use to evaluate theories.
For example, suppose we have a hypothesis which, based on evidence so far, has a subjective “probability of correctness” of, say 0.5. Then we get a new bit of evidence. We observe a white (albino) crow, for example. Doing standard Bayesian updating, the probability of our hypothesis drops to 0.001, say. So we decide to try to resurrect our hypothesis by adding another layer. Trouble is, that we have just increased the K-complexity of the hypothesis, and that ought to hurt us in our original “no-data” prior. Trouble is, we already have data. Lots of it. So is there some algebraic trick which lets us add that new layer to the hypothesis without going back to evidential square one?
K-information is about communicating to “someone”—do you compute the amount of K-information for the most receptive person you’re communicating with, or do you have a different amount for each layer of detail?
Actually, you might have a tree structure, not just layers—the prevalence of white crows in time and space is a different branch than the explanation of how crows can be white.
K-information is about communicating to “someone”—do you compute the amount of K-information for the most receptive person you’re communicating with, or do you have a different amount for each layer of detail?
A very interesting question. Especially when you consider the analogy with canon:Kolmogorov. Here we have an ambiguity as to what person we communicate to. There, the ambiguity was regarding exactly what model of universal Turing machine we were programming. And there, there was a theorem to the effect that the differences among Turing machines aren’t all that big. Do we have a similar theorem here, for the differences among people—seen as universal programmable epistemic engines.
Trouble is, we already have data. Lots of it. So is there some algebraic trick which lets us add that new layer to the hypothesis without going back to evidential square one?
Bayesian updating is timeless. It doesn’t care whether you observed the data before or after you wrote the hypothesis.
So, it sounds like you are suggesting that we can back out all that data, change our hypothesis and prior, and then read the data back in. In theory, yes. But sometimes we don’t even remember the data that brought us to where we are now. Hence the desirability of a trick. Is there an updating-with-new-hypothesis rule to match Bayes’s updating-with-new-evidence rule?
I agree with most of that, but why favor less information content? Though I may not fully understand the math, this recent post by cousin it seems to be saying that priors should not always depend on Kolmogorov complexity.
And, even if we do decide to favor less information content, how much emphasis should we place on it?
In general, I would think that the more information is in a theory, the more specific it is, and the more specific it is, the smaller is the proportion of possible worlds which happen to comply with it.
Regarding how much emphasis we should place on it: I woud say “a lot” but there are complications. Theories aren’t used in isolation, but tend to provide a kind of informally put together world view, and then there is the issue of degree of matching.
Which theory has more information?
All crows are black
All crows are black except
I didn’t say you ignored previous correspondence with reality, though.
That isn’t Perplexed’s point. Let’s say that as of this moment all crows that have been observed are black, so both of his hypotheses fit the data. Why should “all crows are black” be assigned a higher prior than “All crows are black except ”? Based on cousin_it’s post, I don’t see any reason to do that.
So, to revive this discussion: if we must distribute probability mass evenly because we cannot place emphasis on simplicity, shouldn’t our priors be almost zero for every hypothesis? It seems to me that the “underdetermination” problem makes it very hard to use priors in a meaningful way.
I am assuming here that all the crows that we have previously seen have been black, and therefore that both theories have the same agreement, or at least approximate agreement, with what we know.
The second theory clearly has more information content.
Why would it not make sense to use the first theory on this basis?
The fact that all the crows we have seen so far are black makes it a good idea to assume black crows in future. There may be instances of non-black crows, when the theory has predicted black crows, but that simply means that the theory is not 100% accurate.
If the 270 pages of exceptions have not come from anywhere, then the fact that they are not justified just makes them random, unjustified specificity. Out of all the possible worlds we can imagine that are consistent with what we know, the proportion that agree with this specificity is going to be small. If most crows are black, as I am assuming our experience has suggested, then when this second theory predicts a non-black crow, as one of its exceptions, it will probably be wrong: The unjustified specificity is therefore contributing to a failure of the theory. On the other hand, when the occasional non-black crow does show up, there is no reason to think that the second theory is going to be much better at predicting this than the first theory—so the second theory would seem to have all the inaccuracies of wrongful black crow prediction of the first theory, along with extra errors of wrongful non-black crow prediction introduced by the unjustified specificity.
Now, if you want to say that we don’t have experience of mainly black crows, or that the 270 pages of exceptions come from somewhere, then that puts us into a different scenario: a more complicated one.
Looking at it in a simple way, however, I think this example actually just demonstrates that information in a theory should be minimized.
I haven’t been following the discussion on this topic very closely, so my response may be about stuff you already know or already know is wrong. But, since I’m feeling reckless today, I will try to say something interesting.
There are two different information metrics we can use regarding theories. The first deals with how informative a theory is about the world. The ideally informative theory tells us a lot about the world. Or, to say the same thing in different language, an informative theory rules out as many “possible worlds” as it can; it tells us that our own world is very special among all otherwise possible worlds; that the set of worlds consistent with the theory is a small set. We may as well call this kind of information Shannon information or S-information . A Karl Popper fan would approve of making a theory as S-informative as possible, because then it is exposing itself to the greatest risk of refutation.
The second information metric measures how much information is required to communicate the theory to someone. My 270 pages of fine print in the second crow theory might be an example of a theory with a lot of this kind of information. Let us call this kind of information Kolmogorov information, or K-information. My understanding of Occam’s razor is that it recommends that our theories should use as little K-information as possible.
So we have Occam telling us to minimize the K-information and Popper telling us to maximize the S-information. Luckily, the two types of information are not closely related, so (assuming that the universe does not conspire against us) we can frequently do reasonably well by both criteria. So much for the obvious and easy points.
The trouble appears, especially for biologists and other “squishy” scientists, when Nature seems to have set things up so that every law has some exceptions. I’ll leave it to you to Google on either “white crow” or “white raven” and to admire those fine and intelligent birds. So, given our objectives of maximizing one information measure and minimizing the other, how should we proceed? Do we change our law to say “99+% of crows are black?” Do we change it to say “All crows are black, not counting ravens as crows, and except for a fraction under 1% of crows which are albinos and also have pink eyes?” I don’t know, but maybe you have thought about it more than I have.
We change it to say, “99+% of crows have such-and-such alleles of genes for determining feather colour; certain other alleles are rare and result in a bird lacking feather pigments due to the synthesis pathway being broken at such-and-such a step for lack of such-and-such a protein. The mutation is disadvantageous, hence the absence of any substantial population of white crows.” (Or whatever the actual story is, I’m just making that one up.) If we don’t know the actual story, then the best we can do is say that for reasons we don’t know, it happens now and then that black crows can give birth to a white offspring.
Squishiness is not a property of biological phenomena, but of our knowledge of those phenomena. Exceptions are in our descriptions, not in Nature.
I wonder if it helps to arrange K-information in layers. You could start with “Almost all crows are black”, and then add footnotes for how rare white crows actually are, what causes them, how complete we think our information about crow color distribution is and why, and possibly some factors I haven’t thought of.
Layering or modularizing the hypothesis: Of course, you can do this, and you typically do do this. But, layering doesn’t typically change the total quantity of K-information. A complex hypothesis still has a lot of K-information whether you present it as neatly layered or just jumbled together. Which brings us to the issue of just why we bother calculating the K-information content of a hypothesis in the first place.
There is a notion, mentioned in Jaynes and also in another thread active right now, that the K-information content of a hypothesis is directly related to the prior probability that ought to be attached to a hypothesis (in the absence of (or prior to) empirical evidence). So, it seems to me that the interesting thing about your layering suggestion is how the layering should tie in to the Bayesian inference machinery which we use to evaluate theories.
For example, suppose we have a hypothesis which, based on evidence so far, has a subjective “probability of correctness” of, say 0.5. Then we get a new bit of evidence. We observe a white (albino) crow, for example. Doing standard Bayesian updating, the probability of our hypothesis drops to 0.001, say. So we decide to try to resurrect our hypothesis by adding another layer. Trouble is, that we have just increased the K-complexity of the hypothesis, and that ought to hurt us in our original “no-data” prior. Trouble is, we already have data. Lots of it. So is there some algebraic trick which lets us add that new layer to the hypothesis without going back to evidential square one?
K-information is about communicating to “someone”—do you compute the amount of K-information for the most receptive person you’re communicating with, or do you have a different amount for each layer of detail?
Actually, you might have a tree structure, not just layers—the prevalence of white crows in time and space is a different branch than the explanation of how crows can be white.
A very interesting question. Especially when you consider the analogy with canon:Kolmogorov. Here we have an ambiguity as to what person we communicate to. There, the ambiguity was regarding exactly what model of universal Turing machine we were programming. And there, there was a theorem to the effect that the differences among Turing machines aren’t all that big. Do we have a similar theorem here, for the differences among people—seen as universal programmable epistemic engines.
Bayesian updating is timeless. It doesn’t care whether you observed the data before or after you wrote the hypothesis.
So, it sounds like you are suggesting that we can back out all that data, change our hypothesis and prior, and then read the data back in. In theory, yes. But sometimes we don’t even remember the data that brought us to where we are now. Hence the desirability of a trick. Is there an updating-with-new-hypothesis rule to match Bayes’s updating-with-new-evidence rule?