I haven’t been following the discussion on this topic very closely, so my response may be about stuff you already know or already know is wrong. But, since I’m feeling reckless today, I will try to say something interesting.
There are two different information metrics we can use regarding theories. The first deals with how informative a theory is about the world. The ideally informative theory tells us a lot about the world. Or, to say the same thing in different language, an informative theory rules out as many “possible worlds” as it can; it tells us that our own world is very special among all otherwise possible worlds; that the set of worlds consistent with the theory is a small set. We may as well call this kind of information Shannon information or S-information . A Karl Popper fan would approve of making a theory as S-informative as possible, because then it is exposing itself to the greatest risk of refutation.
The second information metric measures how much information is required to communicate the theory to someone. My 270 pages of fine print in the second crow theory might be an example of a theory with a lot of this kind of information. Let us call this kind of information Kolmogorov information, or K-information. My understanding of Occam’s razor is that it recommends that our theories should use as little K-information as possible.
So we have Occam telling us to minimize the K-information and Popper telling us to maximize the S-information. Luckily, the two types of information are not closely related, so (assuming that the universe does not conspire against us) we can frequently do reasonably well by both criteria. So much for the obvious and easy points.
The trouble appears, especially for biologists and other “squishy” scientists, when Nature seems to have set things up so that every law has some exceptions. I’ll leave it to you to Google on either “white crow” or “white raven” and to admire those fine and intelligent birds. So, given our objectives of maximizing one information measure and minimizing the other, how should we proceed? Do we change our law to say “99+% of crows are black?” Do we change it to say “All crows are black, not counting ravens as crows, and except for a fraction under 1% of crows which are albinos and also have pink eyes?” I don’t know, but maybe you have thought about it more than I have.
The trouble appears, especially for biologists and other “squishy” scientists, when Nature seems to have set things up so that every law has some exceptions. I’ll leave it to you to Google on either “white crow” or “white raven” and to admire those fine and intelligent birds. So, given our objectives of maximizing one information measure and minimizing the other, how should we proceed? Do we change our law to say “99+% of crows are black?” Do we change it to say “All crows are black, not counting ravens as crows, and except for a fraction under 1% of crows which are albinos and also have pink eyes?”
We change it to say, “99+% of crows have such-and-such alleles of genes for determining feather colour; certain other alleles are rare and result in a bird lacking feather pigments due to the synthesis pathway being broken at such-and-such a step for lack of such-and-such a protein. The mutation is disadvantageous, hence the absence of any substantial population of white crows.” (Or whatever the actual story is, I’m just making that one up.) If we don’t know the actual story, then the best we can do is say that for reasons we don’t know, it happens now and then that black crows can give birth to a white offspring.
Squishiness is not a property of biological phenomena, but of our knowledge of those phenomena. Exceptions are in our descriptions, not in Nature.
I wonder if it helps to arrange K-information in layers. You could start with “Almost all crows are black”, and then add footnotes for how rare white crows actually are, what causes them, how complete we think our information about crow color distribution is and why, and possibly some factors I haven’t thought of.
Layering or modularizing the hypothesis: Of course, you can do this, and you typically do do this. But, layering doesn’t typically change the total quantity of K-information. A complex hypothesis still has a lot of K-information whether you present it as neatly layered or just jumbled together. Which brings us to the issue of just why we bother calculating the K-information content of a hypothesis in the first place.
There is a notion, mentioned in Jaynes and also in another thread active right now, that the K-information content of a hypothesis is directly related to the prior probability that ought to be attached to a hypothesis (in the absence of (or prior to) empirical evidence). So, it seems to me that the interesting thing about your layering suggestion is how the layering should tie in to the Bayesian inference machinery which we use to evaluate theories.
For example, suppose we have a hypothesis which, based on evidence so far, has a subjective “probability of correctness” of, say 0.5. Then we get a new bit of evidence. We observe a white (albino) crow, for example. Doing standard Bayesian updating, the probability of our hypothesis drops to 0.001, say. So we decide to try to resurrect our hypothesis by adding another layer. Trouble is, that we have just increased the K-complexity of the hypothesis, and that ought to hurt us in our original “no-data” prior. Trouble is, we already have data. Lots of it. So is there some algebraic trick which lets us add that new layer to the hypothesis without going back to evidential square one?
K-information is about communicating to “someone”—do you compute the amount of K-information for the most receptive person you’re communicating with, or do you have a different amount for each layer of detail?
Actually, you might have a tree structure, not just layers—the prevalence of white crows in time and space is a different branch than the explanation of how crows can be white.
K-information is about communicating to “someone”—do you compute the amount of K-information for the most receptive person you’re communicating with, or do you have a different amount for each layer of detail?
A very interesting question. Especially when you consider the analogy with canon:Kolmogorov. Here we have an ambiguity as to what person we communicate to. There, the ambiguity was regarding exactly what model of universal Turing machine we were programming. And there, there was a theorem to the effect that the differences among Turing machines aren’t all that big. Do we have a similar theorem here, for the differences among people—seen as universal programmable epistemic engines.
Trouble is, we already have data. Lots of it. So is there some algebraic trick which lets us add that new layer to the hypothesis without going back to evidential square one?
Bayesian updating is timeless. It doesn’t care whether you observed the data before or after you wrote the hypothesis.
So, it sounds like you are suggesting that we can back out all that data, change our hypothesis and prior, and then read the data back in. In theory, yes. But sometimes we don’t even remember the data that brought us to where we are now. Hence the desirability of a trick. Is there an updating-with-new-hypothesis rule to match Bayes’s updating-with-new-evidence rule?
I haven’t been following the discussion on this topic very closely, so my response may be about stuff you already know or already know is wrong. But, since I’m feeling reckless today, I will try to say something interesting.
There are two different information metrics we can use regarding theories. The first deals with how informative a theory is about the world. The ideally informative theory tells us a lot about the world. Or, to say the same thing in different language, an informative theory rules out as many “possible worlds” as it can; it tells us that our own world is very special among all otherwise possible worlds; that the set of worlds consistent with the theory is a small set. We may as well call this kind of information Shannon information or S-information . A Karl Popper fan would approve of making a theory as S-informative as possible, because then it is exposing itself to the greatest risk of refutation.
The second information metric measures how much information is required to communicate the theory to someone. My 270 pages of fine print in the second crow theory might be an example of a theory with a lot of this kind of information. Let us call this kind of information Kolmogorov information, or K-information. My understanding of Occam’s razor is that it recommends that our theories should use as little K-information as possible.
So we have Occam telling us to minimize the K-information and Popper telling us to maximize the S-information. Luckily, the two types of information are not closely related, so (assuming that the universe does not conspire against us) we can frequently do reasonably well by both criteria. So much for the obvious and easy points.
The trouble appears, especially for biologists and other “squishy” scientists, when Nature seems to have set things up so that every law has some exceptions. I’ll leave it to you to Google on either “white crow” or “white raven” and to admire those fine and intelligent birds. So, given our objectives of maximizing one information measure and minimizing the other, how should we proceed? Do we change our law to say “99+% of crows are black?” Do we change it to say “All crows are black, not counting ravens as crows, and except for a fraction under 1% of crows which are albinos and also have pink eyes?” I don’t know, but maybe you have thought about it more than I have.
We change it to say, “99+% of crows have such-and-such alleles of genes for determining feather colour; certain other alleles are rare and result in a bird lacking feather pigments due to the synthesis pathway being broken at such-and-such a step for lack of such-and-such a protein. The mutation is disadvantageous, hence the absence of any substantial population of white crows.” (Or whatever the actual story is, I’m just making that one up.) If we don’t know the actual story, then the best we can do is say that for reasons we don’t know, it happens now and then that black crows can give birth to a white offspring.
Squishiness is not a property of biological phenomena, but of our knowledge of those phenomena. Exceptions are in our descriptions, not in Nature.
I wonder if it helps to arrange K-information in layers. You could start with “Almost all crows are black”, and then add footnotes for how rare white crows actually are, what causes them, how complete we think our information about crow color distribution is and why, and possibly some factors I haven’t thought of.
Layering or modularizing the hypothesis: Of course, you can do this, and you typically do do this. But, layering doesn’t typically change the total quantity of K-information. A complex hypothesis still has a lot of K-information whether you present it as neatly layered or just jumbled together. Which brings us to the issue of just why we bother calculating the K-information content of a hypothesis in the first place.
There is a notion, mentioned in Jaynes and also in another thread active right now, that the K-information content of a hypothesis is directly related to the prior probability that ought to be attached to a hypothesis (in the absence of (or prior to) empirical evidence). So, it seems to me that the interesting thing about your layering suggestion is how the layering should tie in to the Bayesian inference machinery which we use to evaluate theories.
For example, suppose we have a hypothesis which, based on evidence so far, has a subjective “probability of correctness” of, say 0.5. Then we get a new bit of evidence. We observe a white (albino) crow, for example. Doing standard Bayesian updating, the probability of our hypothesis drops to 0.001, say. So we decide to try to resurrect our hypothesis by adding another layer. Trouble is, that we have just increased the K-complexity of the hypothesis, and that ought to hurt us in our original “no-data” prior. Trouble is, we already have data. Lots of it. So is there some algebraic trick which lets us add that new layer to the hypothesis without going back to evidential square one?
K-information is about communicating to “someone”—do you compute the amount of K-information for the most receptive person you’re communicating with, or do you have a different amount for each layer of detail?
Actually, you might have a tree structure, not just layers—the prevalence of white crows in time and space is a different branch than the explanation of how crows can be white.
A very interesting question. Especially when you consider the analogy with canon:Kolmogorov. Here we have an ambiguity as to what person we communicate to. There, the ambiguity was regarding exactly what model of universal Turing machine we were programming. And there, there was a theorem to the effect that the differences among Turing machines aren’t all that big. Do we have a similar theorem here, for the differences among people—seen as universal programmable epistemic engines.
Bayesian updating is timeless. It doesn’t care whether you observed the data before or after you wrote the hypothesis.
So, it sounds like you are suggesting that we can back out all that data, change our hypothesis and prior, and then read the data back in. In theory, yes. But sometimes we don’t even remember the data that brought us to where we are now. Hence the desirability of a trick. Is there an updating-with-new-hypothesis rule to match Bayes’s updating-with-new-evidence rule?