On the other hand, if we do extend entropy to arbitrary propositions A, it probably does make sense to define it as the conditional expectation S(A) = E[-log p | A], as you did.
Then “average entropy”/”entropy” of a macrostate p_A is S(True) under the distribution p_A, and “entropy”/”surprisal” of a microstate B (in the macrostate p_A) is S(B) under the distribution p_A.
By a slight coincidence, S(True) = S(A) under p_A, but S(True) is the thing that generalizes to give entropy of an arbitrary distribution.
I think I was a little confused about your comment and leapt to one possible definition of S() which doesn’t satisfy all the desiderata you had. Also, I don’t like my definition anymore, anyway.
Disclaimer: This is probably not a good enough definition to be worth spending much time worrying about.
First things first:
We may perhaps think of fundamental “microstates” as (descriptions of) “possible worlds”, or complete, maximally specific possible ways the world may be. Since all possible worlds are mutually exclusive (just exactly one possible world is the actual world), every proposition can be seen as a disjunction of such possible worlds: the worlds in which the proposition is true.
I think this is indeed how we should think of “microstates”. (I don’t want to use the word “macrostate” at all, at this point.)
I was thinking of something like: given a probability distribution p and a proposition A, define
“S(A) under p” = ∑x∈Ap(x)(−logp(x))∑x∈Ap(x)
where the sums are over all microstates x in A. Note that the denominator is equal to p(A).
I also wrote this as S(A) = expectation of (-log p(x)) conditional on A, or S(A)=E[(−logp)|A], but I think “log p” was not clearly “log p(x) for a microstate x” in my previous comment.
I also defined a notation p_A to represent the probability distribution that assigns probability 1/|A| to each x in A and 0 to each x not in A.
I used T to mean a tautology (in this context: the full set of microstates).
Then I pointed out a couple consequences:
Typically, when people talk about the “entropy of a macrostate A”, they mean something equal to log|A|. Conceptually, this is based on the calculation ∑x∈A1|A|(−log1|A|), which is the same as either “S(A) under p_A” (in my goofy notation) or “S(T) under p_A”, but I was claiming that you should think of it as the latter.
The (Shannon/Gibbs) entropy of p, for a distribution p, is equal to “S(T) under p” in this notation.
Finally, for a microstate x in any distribution p, we get that “S({x}) under p” is equal to -log p(x).
All of this satisfied my goals of including the most prominent concepts in Alex’s post:
log |A| for a macrostate A
Shannon/Gibbs entropy of a distribution p
-log p(x) for a microstate x
And a couple other goals:
Generalizing the Shannon/Gibbs entropy, which is S(p)=Ex∼p[−logp(x)], in a natural way to incorporate a proposition A (by making the expectation into a conditional expectation)
Not doing too much violence to the usual meaning of “entropy of macrostate A” or “the entropy of p” in the process
But it did so at the cost of:
making “the entropy of macrostate A” and “S(A) under p” two different things
contradicting standard terminology and notation anyway
reinforcing the dependence on microstates and the probabilities of microstates, contrary to what you wanted to do
So I would probably just ignore it and do your own thing.
Okay, I understand. The problem with fundamental microstates is that they only really make sense if they are possible worlds, and possible worlds bring their own problems.
One is: we can gesture at them, but we can’t grasp them. They are too big, they each describe a whole world. We can grasp the proposition that snow is white, but not the equivalent disjunction of all the possible worlds where snow is white. So we can’t use then for anything psychological like subjective Bayesianism. But maybe that’s not your goal anyway.
A more general problem is that there are infinitely many possible worlds. There are even infinitely many where snow is white. This means it is unclear how we should define a uniform probability distribution over them. Naively, if 1∞ is 0, their probabilities do not sum to 1, and if it is larger than 0, they sum to infinity. Either option would violate the probability axioms.
Warning: long and possibly unhelpful tangent ahead
Wittgenstein’s solution for this and other problems (in the Tractatus) was to ignore possible worlds and instead regard “atomic propositions” as basic. Each proposition is assumed to be equivalent to a finite logical combination of such atomic propositions, where logical combination means propositional logic (i.e. with connectives like not, and, or, but without quantifiers). Then the a priori probability of a proposition is defined as the rows in its truth table where the proposition is true divided by the total number of rows. For example, for a and b atomic, the proposition a∨b has probability 3⁄4, while a∧b has probability 1/4: The disjunction has three out of four possible truth-makers - (true, true), (true, false), (false, true), while the conjunction has only one - (true, true).
This definition in terms of the ratio of true rows in the “atomicized” truth-table is equivalent to the assumption that all atomic propositions have probability 1⁄2 and that they are all probabilistically independent.
Wittgenstein did not do it, but we can then also definite a measure of information content (or surprisal, or entropy, or whatever we want to call it) of propositions, in the following way:
Each atomic proposition has information content 1.
The information content of the conjunction of two atomic propositions is additive.
The information content of a tautology is 0.
So for a conjunction of atomic propositions with length n the information content of that conjunction is n (1+1+1+...=n), while its probability is 2−n (1/2×1/2×1/2×...=2−n). Generalizing this to arbitrary (i.e. possibly non-atomic) propositions A, the relation between probability p and information content i is
2−i(A)=p(A)
or, equivalently,
i(A)=−log2p(A).
Now that formula sure looks familiar!
The advantage of Wittgenstein’s approach is that we can assign an a priori probability distribution to propositions without having to assume a uniform probability distribution over possible worlds. It is assumed that each proposition is only a finite logical combination of atomic propositions, which would avoid problems with infinity. The same thing holds for information content (or “entropy” if you will).
Problem is … it is unclear what atomic propositions are. Wittgenstein did believe in them, and so did Bertrand Russell, but Wittgenstein eventually gave up the idea. To be clear, propositions expressed by sentences like “Snow is white” are not atomic in Wittgenstein’s sense. “Snow is white” is not probabilistically independent of “Snow is green”, and it doesn’t necessarily seem to have a priori probability 1⁄2. Moreover, the restriction to propositional logic is problematic. If we assume quantifiers, Wittgenstein suggested that we interpret the universal quantifier “all” as a possibly infinite conjunction of atomic propositions, and the existential quantifier “some” as a possibly infinite disjunction of atomic propositions. But that leads again to problems with infinity. It would always give the former probability 0 and the latter probability 1.
So logical atomism may be just as dead an end as possible worlds, perhaps worse. But it is somewhat interesting to note that approaches like algorithmic complexity have similar issues. We may want to assign a string of bits a probability or a complexity (an entropy? an information content?), but we may also want to say that some such string corresponds to a proposition, e.g. a hypothesis we are interested in. There is some superficial way of associating a binary string with propositional formulas, by interpreting e.g.1001 as a conjunction a∧¬b∧¬c∧d. But there likewise seems to be no room for quantifiers in this interpretation.
I guess a question is what you want to do with your entropy theory. Personally I would like to find some formalization of Ockham’s razor which is applicable to Bayesianism. Here the problems mentioned above appear fatal. Maybe for your purposes the issues aren’t as bad though?
Could you clarify this part?
I think I don’t understand your notation here.
I think I was a little confused about your comment and leapt to one possible definition of S() which doesn’t satisfy all the desiderata you had. Also, I don’t like my definition anymore, anyway.
Disclaimer: This is probably not a good enough definition to be worth spending much time worrying about.
First things first:
I think this is indeed how we should think of “microstates”. (I don’t want to use the word “macrostate” at all, at this point.)
I was thinking of something like: given a probability distribution p and a proposition A, define
“S(A) under p” = ∑x∈Ap(x)(−logp(x))∑x∈Ap(x)
where the sums are over all microstates x in A. Note that the denominator is equal to p(A).
I also wrote this as S(A) = expectation of (-log p(x)) conditional on A, or S(A)=E[(−logp)|A], but I think “log p” was not clearly “log p(x) for a microstate x” in my previous comment.
I also defined a notation p_A to represent the probability distribution that assigns probability 1/|A| to each x in A and 0 to each x not in A.
I used T to mean a tautology (in this context: the full set of microstates).
Then I pointed out a couple consequences:
Typically, when people talk about the “entropy of a macrostate A”, they mean something equal to log|A|. Conceptually, this is based on the calculation ∑x∈A1|A|(−log1|A|), which is the same as either “S(A) under p_A” (in my goofy notation) or “S(T) under p_A”, but I was claiming that you should think of it as the latter.
The (Shannon/Gibbs) entropy of p, for a distribution p, is equal to “S(T) under p” in this notation.
Finally, for a microstate x in any distribution p, we get that “S({x}) under p” is equal to -log p(x).
All of this satisfied my goals of including the most prominent concepts in Alex’s post:
log |A| for a macrostate A
Shannon/Gibbs entropy of a distribution p
-log p(x) for a microstate x
And a couple other goals:
Generalizing the Shannon/Gibbs entropy, which is S(p)=Ex∼p[−logp(x)], in a natural way to incorporate a proposition A (by making the expectation into a conditional expectation)
Not doing too much violence to the usual meaning of “entropy of macrostate A” or “the entropy of p” in the process
But it did so at the cost of:
making “the entropy of macrostate A” and “S(A) under p” two different things
contradicting standard terminology and notation anyway
reinforcing the dependence on microstates and the probabilities of microstates, contrary to what you wanted to do
So I would probably just ignore it and do your own thing.
Okay, I understand. The problem with fundamental microstates is that they only really make sense if they are possible worlds, and possible worlds bring their own problems.
One is: we can gesture at them, but we can’t grasp them. They are too big, they each describe a whole world. We can grasp the proposition that snow is white, but not the equivalent disjunction of all the possible worlds where snow is white. So we can’t use then for anything psychological like subjective Bayesianism. But maybe that’s not your goal anyway.
A more general problem is that there are infinitely many possible worlds. There are even infinitely many where snow is white. This means it is unclear how we should define a uniform probability distribution over them. Naively, if 1∞ is 0, their probabilities do not sum to 1, and if it is larger than 0, they sum to infinity. Either option would violate the probability axioms.
Warning: long and possibly unhelpful tangent ahead
Wittgenstein’s solution for this and other problems (in the Tractatus) was to ignore possible worlds and instead regard “atomic propositions” as basic. Each proposition is assumed to be equivalent to a finite logical combination of such atomic propositions, where logical combination means propositional logic (i.e. with connectives like not, and, or, but without quantifiers). Then the a priori probability of a proposition is defined as the rows in its truth table where the proposition is true divided by the total number of rows. For example, for a and b atomic, the proposition a∨b has probability 3⁄4, while a∧b has probability 1/4: The disjunction has three out of four possible truth-makers - (true, true), (true, false), (false, true), while the conjunction has only one - (true, true).
This definition in terms of the ratio of true rows in the “atomicized” truth-table is equivalent to the assumption that all atomic propositions have probability 1⁄2 and that they are all probabilistically independent.
Wittgenstein did not do it, but we can then also definite a measure of information content (or surprisal, or entropy, or whatever we want to call it) of propositions, in the following way:
Each atomic proposition has information content 1.
The information content of the conjunction of two atomic propositions is additive.
The information content of a tautology is 0.
So for a conjunction of atomic propositions with length n the information content of that conjunction is n (1+1+1+...=n), while its probability is 2−n (1/2×1/2×1/2×...=2−n). Generalizing this to arbitrary (i.e. possibly non-atomic) propositions A, the relation between probability p and information content i is 2−i(A)=p(A) or, equivalently, i(A)=−log2p(A). Now that formula sure looks familiar!
The advantage of Wittgenstein’s approach is that we can assign an a priori probability distribution to propositions without having to assume a uniform probability distribution over possible worlds. It is assumed that each proposition is only a finite logical combination of atomic propositions, which would avoid problems with infinity. The same thing holds for information content (or “entropy” if you will).
Problem is … it is unclear what atomic propositions are. Wittgenstein did believe in them, and so did Bertrand Russell, but Wittgenstein eventually gave up the idea. To be clear, propositions expressed by sentences like “Snow is white” are not atomic in Wittgenstein’s sense. “Snow is white” is not probabilistically independent of “Snow is green”, and it doesn’t necessarily seem to have a priori probability 1⁄2. Moreover, the restriction to propositional logic is problematic. If we assume quantifiers, Wittgenstein suggested that we interpret the universal quantifier “all” as a possibly infinite conjunction of atomic propositions, and the existential quantifier “some” as a possibly infinite disjunction of atomic propositions. But that leads again to problems with infinity. It would always give the former probability 0 and the latter probability 1.
So logical atomism may be just as dead an end as possible worlds, perhaps worse. But it is somewhat interesting to note that approaches like algorithmic complexity have similar issues. We may want to assign a string of bits a probability or a complexity (an entropy? an information content?), but we may also want to say that some such string corresponds to a proposition, e.g. a hypothesis we are interested in. There is some superficial way of associating a binary string with propositional formulas, by interpreting e.g.1001 as a conjunction a∧¬b∧¬c∧d. But there likewise seems to be no room for quantifiers in this interpretation.
I guess a question is what you want to do with your entropy theory. Personally I would like to find some formalization of Ockham’s razor which is applicable to Bayesianism. Here the problems mentioned above appear fatal. Maybe for your purposes the issues aren’t as bad though?