Cool cool. I can personally see the appeal of reserving ‘entropy’ for the case where the distribution and the (natural lifting of) the code (to a distribution) are identical, i.e. your proposal without the wiggle-room.
I don’t yet personally see a boundary between the wiggle-room you’re considering and full-on “we can say ‘entropy’ as a shorthand for ‘cross-entropy’ when the second distribution is clear from context” proposal.
In particular, I currently suspect that there’s enough wiggle-room in “optimal code of its type for compression under the source distribution” to drive a truck through. Like, if we start out with a uniform code C and a state s, why not say that the “type of codes” for the source distribution δ(s) is the powerset of {c ∈ C | c codes for s}? In which case the “optimal code for compression” is the set of all such c, and the ‘entropy’ is the Nate!complexity?
I’m not yet sure whether our different aesthetics here are due to:
me failing to see a natural boundary that you’re pointing to
you not having yet seen how slippery the slope is
you having a higher tolerance for saying “humans sometimes just wanna put fences halfway down the slippery slope, dude”.
Insofar as you think I’m making the mistake of (1), I’m interested to hear arguments. My argument above is ofc tuned to case (2), and it’s plausible to me that it pushes you off the fence towards “no wiggle room”.
Another place we might asethetically differ is that I’m much happier blurring the line between entropy and cross-entropy.
One handwavy argument for blurring the line (which has the epistemic status: regurgitating from a related cache that doesn’t cleanly apply) is that if the statespace is uncountably infinite then we need a measure in order to talk about entropy (and make everything work out nicely under change-of-variables). And so in the general case, entropy is already a two-place predicate function involving a distribution and some sort of measure. (...Although my cache is unclear on whether this generalization yields the KL divergence or the cross-entropy. Probably KL. But, like, if we’re holding the correct distribution fixed, then they differ only by a constant. Hopefully. This is as far as I can go w/out warming up that cache, and I don’t wanna warm it up right now \shrug.)
My argument above is ofc tuned to case (2), and it’s plausible to me that it pushes you off the fence towards “no wiggle room”.
Yup, I think I am happy to abandon the wiggle room at this point, for this reason.
if the statespace is uncountably infinite then we need a measure in order to talk about entropy (and make everything work out nicely under change-of-variables). And so in the general case, entropy is already a two-place predicate involving a distribution and some sort of measure.
I think my preferred approach to this is that the density p(x) is not really the fundamental object, and should be thought of as dP/dmu(x), with the measure in the denominator. We multiply by dmu(x) in the integral for entropy in order to remove this dependence on mu that we accidentally introduced. EDIT: this is flagrantly wrong because log(p) depends on the measure also. You’re right that this is really a function of the distribution and the measure; I’m not sure offhand if it’s crossentropy, either, but I’m going to think about this more. (This is an embarrassing mistake because I already knew differential entropy was cursed with dependence on a measure—quantum mechanics famously provides the measure on phase-space that classical statistical mechanics took as axiomatic.)
For what it’s worth, I’ve heard the take “entropy and differential entropy are different sorts of things” several times; I might be coming around to that, now that I see another slippery slope on the horizon.
The state-space (for particles) in statmech is the space of possible positions and momenta for all particles.
The measure that’s used is uniform over each coordinate of position and momentum, for each particle.
This is pretty obvious and natural, but not forced on us, and:
You get different, incorrect predictions about thermodynamics (!) if you use a different measure.
The level of coarse graining is unknown, so every quantity of entropy has an extra “+ log(# microstates per unit measure)” which is an unknown additive constant. (I think this is separate from the relationship between bits and J/K, which is a multiplicative constant for entropy—k_B—and doesn’t rely on QM afaik.)
On the other hand, Liouville’s theorem gives some pretty strong justification for using this measure, alleviating (1) somewhat:
In quantum mechanics, you have discrete energy eigenstates (...in a bound system, there are technicalities here...) and you can define a microstate to be an energy eigenstate, which lets you just count things and not worry about measure. This solves both problems:
Counting microstates and taking the classical limit gives the “dx dp” (aka “dq dp”) measure, ruling out any other measure.
It tells you how big your microstates are in phase space (the answer is related to Planck’s constant, which you’ll note has units of position * momentum).
I wish I had a better citation but I’m not sure I do.
In general it seems like (2) is talked about more in the literature, even though I think (1) is more interesting. This could be because Liouville’s theorem provides enough justification for most people’s tastes.
Finally, knowing “how big your microstates are” is what tells you where quantum effects kick in. (Or vice versa—Planck estimated the value of the Planck constant by adjusting the spacing of his quantized energy levels until his predictions for blackbody radiation matched the data.)
(strong upvote for shifting position in realtime, even if by small degrees and towards the opposite side of the fence from me :-p)
(I’m not actually familiar enough w/ statmech to know what measure we use on phase-space, and I’m interested in links that explain what it is, and how it falls out of QM, if you have any handy :D)
I don’t currently have much sympathy for “entropy and differential entropy are different” view, b/c I occasionally have use for non-uniform measures even in the finite case.
Like, maybe I’m working with distributions over 10 states, and I have certain constraints I’m trying to hit, and subject to those constraints I wanna maximize entropy wrt the Binomial distribution as the measure. And you might be like “Well, stop doing that, and start working with distributions on 2^10 states, and convert all your constraints to refer to the bitcount of the new state (instead of the value of the old state). Now you can meet your constraints while maximizing entropy wrt the uniform measure like a normal person.” To which I’d reply (somewhat trollishly) “that rejoinder doesn’t work in the limit, so it must get weaker and weaker as I work with more and more states, and at numbers as large as 10 this rejoinder is so weak as to not compel me.”
From my perspective, the obvious rejoinder to “entropy is already two-place” is “insofar as entropy is two-place, cross-entropy is three-place!”. Which, ftr, I might find compelling. It depends whether differential cross-entropy needs all three parameters (P, Q, μ), or whether we can combine two of them (like by using (P, μP/Q) or something). Or, at least, that’s what my intuition says off-the-cuff; I’m running on guesswork and a stale cache here :-p.
From my perspective, the obvious rejoinder to “entropy is already two-place” is “insofar as entropy is two-place, cross-entropy is three-place!”.
I think this is roughly where I’m at now.
After thinking a bit and peeking at Wikipedia, the situation seems to be:
The differential entropy of a probability density p is usually defined as
−∫p(x)log(p(x))dx
This is unfortunate, because it isn’t invariant under coordinate transformations on x. A more principled (e.g. invariant) thing to write down, courtesy of Jaynes, is
−∫p(x)log(p(x)m(x))dx
where m=dμ/dx is a density function for some measure μ. We can also write this as
−∫log(dPdμ)dP (Jaynes’ continuous entropy of P with respect to μ)
in terms of a probability measure P with p=dP/dx, which is a bit more clearly invariant.
Now we can define a cross-entropy-like thing as
−∫log(dQdμ)dP (continuous cross-entropy of Q under P with respect to μ)
...and a small surprise is coming up. Jumping back to the discrete case, the KL divergence or “relative entropy” is
What happens when we try to write something analogous with our new continuous entropy and crossentropy? We get
−∫log(dQdμ)dP+∫log(dPdμ)dP=∫log(dPdQ)dP
which looks like the right generalization of discrete KL divergence, but also happens to be the continuous entropy of P with respect to Q! (The dependence on μ drops out.) This is weird but I think it might be a red herring.
We can recover the usual definitions in the discrete (finite or infinite) case by taking μ to be a uniform measure that assigns 1 to each state (note that this is NOT a probability measure—I never said μ was a probability measure, and I don’t know how to handle the countably infinite case if it is.)
So maybe the nicest single thing to define, for the sake of making everything else a concisely-specified special case, is
H(P,Q;μ)=EP[−log(dQdμ)]=−∫log(dQdμ)dP
(“H” chosen because it is used for both entropy and cross-entropy, afaik unlike “S”).
We could take KL divergence as a two-argument atomic thing, or work with −∫log(f)dP for a scalar function f and a distribution P, but I think these both make it cumbersome to talk about the things everyone wants to talk about all the time. I’m open to those possibilities, though.
The weird observation about KL divergence above is a consequence of the identity
H(P,Q;μ)−H(P,R;μ)=H(P,Q;R)
but I think this is a slightly strange fact because R is a probability measure and μ is a general measure. Hmm.
We also have
H(P,Q;μ)=−H(P,μ;Q)
in the case that μ is a probability measure, but not otherwise.
And I can rattle off some claims about special cases in different contexts:
A (the?) fundamental theorem of entropy-like stuff is H(P,Q;μ)≥H(P,P;μ), or “crossentropy >= entropy”.
In the discrete case, practitioners of information theory and statistical physics tend to assume μ is uniform with measure 1 on each state, and (almost?) never discuss it directly. I’ll call this measure U.
They reserve the phrase “entropy of P” (in the discrete case) for H(P, P; U), and this turns out to be an extremely useful quantity.
They use “cross-entropy of Q relative to P” to mean H(P, Q; U)
They use “KL divergence from Q to P” (or “relative entropy …”) to mean H(P, Q; P) = -H(P, P; Q) = H(P, Q; U) - H(P, P; U)
For a state s, let’s define delta_s to be the distribution that puts measure 1 on s and 0 everywhere else.
For a code C, let’s define f_C to be the “coin flip” distribution that puts probability 2^-len(C(s)) on state s (or the sum of this over all codewords for s, if there are several). (Problem: what if C doesn’t assign all its available strings, so that this isn’t a probability distribution? idk.)
If a state s has only one codeword C(s), the length of that codeword is len(C(s)) = H(delta_s, f_C; U)
Note that this is not H(delta_s, delta_s; U), the entropy of delta_s, which is always zero. This is a lot of why I don’t like taking “entropy of s” to mean H(delta_s, f_C; U).
Information theorists use “surprisal of s under P” (or “self-information …”) to mean H(delta_s, P; U)
The expected length of a code C, for a source distribution P, is H(P, f_C; U)
Consequence from fundamental thm: a code is optimal iff f_C = P; the minimal expected length of the code is the entropy of P
If you use the code C when you should’ve used D, your expected length is H(f_D, f_C; U) and you have foolishly wasted H(f_D, f_C; f_D) bits in expectation.
Physicists use “entropy of a macrostate” to mean things like H(p_A, p_A; U) where p_A is a uniform probability measure over a subset A of the state space, and possibly also for other things of the form H(P, P; U) where P is some probability measure that corresponds to your beliefs if you only know macroscopic things.
Kolmogorov complexity (“algorithmic entropy”) of a state s is, as you argue, maybe best seen as an approximation of H(delta_s, p_Solomonoff; U). I don’t see a good way of justifying the name by analogy to any “entropy” in any other context, but maybe I’m missing something.
This makes that one quote from Wiktionary into something of the form: “H(P, p_Solomonoff; U) ~= H(P, P, U) whenever P is computed by a short program”. This makes sense for some sense of “~=” that I haven’t thought through.
When working in continuous spaces like R, people typically define “[differential] entropy of P” as H(P, P; L) where L is the Lebesgue measure (including the uniform measure on R). If they are working with coordinate transformations from R to R, they implicitly treat one variable as the “true” one which the measure should be uniform on, often without saying so.
They define cross-entropy as H(P, Q; L)....
...and KL divergence as H(P, Q; L) - H(P, P; L) = H(P, Q; P) (same as discrete case!)
But sometimes people use Jaynes’ definitions, which replace L with an arbitrary (and explicitly-specified) measure, like we’re doing here. In that case, “entropy” is H(P, P; mu) for an arbitrary mu.
The properness of the logarithmic scoring rule also follows from the fundamental theorem. Your (subjective) expected score is H(your_beliefs, your_stated_beliefs; U) and this is minimized for your_stated_beliefs = your_beliefs.
I wanted to let that comment be about the interesting question of how we unify these various things.
But on the ongoing topic of “why not call all this entropy, if it’s all clearly part of the same pattern?”:
When the definition of some F(x) refers to x twice, it’s often useful to replace one of them with y and call that G(x, y). But it’s usually not good for communication to choose a name for G(x, y) that (almost) everyone else uses exclusively for F(x),especially if you aren’t going to mention both x and y every time you use it, and doubly especially if G is already popular enough to have lots of names of its own (you might hate those names, but get your own[1]).
e.g.: x*y is not “the square of x and y” much less “the square of x [and y is implied from context]”, and the dot product v.w is not “the norm-squared of v and w” etc.
(also fyi, the markdown syntax for footnotes is like blah blah blah[^1] and then somewhere else in the file, on a newline, [^1]: content of the footnote)
This updates me a fair bit towards the view that we should keep entropy and cross-entropy separate.
The remaining crux for me is whether the info theory folks are already using “entropy” to mean “the minimum expected surprise you can achieve by choosing a code from this here space of preferred codes” (as per your “wiggle room” above), in which case my inclination is to instead cackle madly while racing my truck through their wiggle-room, b/c that’s clearly a cross-entropy.
At a glance through wikipedia, it doesn’t look like they’re doing that, though, so I retreat.
But, hmm, I’m not sure I retreat all the way to having separate words.
I’m persuaded that “the square of x” should absolutely not mean “x * y [where y is implied from context]”, and I no longer think that “entropy” should mean the cross-entropy with Q implied from context. (Thanks for the distillation of why it’s mad!)
But, like, if geometrists instead invented a two-place operation rect, with a general form rect(x, y) := x * y, and declared that rect(x) was shorthand for rect(x, x), then I would not protest; this seems like a reasonable an unambiguous way to reduce the number of names floating around.[1]
And this seems to me like exactly what the information theorists (by contrast to the physicists) are doing with H (by contrast to S)!
Like, the notations H(P) and H(P,Q) are just begging for us to pronounce H the same way in each case; we’re not tempted to pronounce the former as “eych” and the latter as “cross-eych”. And no ambiguity arises, so long as we use the rule that if Q is left out then we mean P.
Thus, I propose the following aggressive nomenclature:
the “P-entropy of Q (wrt μ)” aka Hμ(P,Q) is the general form
the “entropy of P (wrt μ)” aka Hμ(P) is a shorthand for “the P-entropy of P (wrt μ)”
μ may be omitted when the canonical uniform measure is clear from context
With the following bold extension:
the “Q-complexity of P (wrt μ)” aka Kμ(Q,P) is another name for Hμ(P,Q)
the “complexity of P (wrt μ)” aka Kμ(P) is used when Q can be inferred from context
we cast codes/UTMs/points/sets to distributions in natural ways
Thus, when we have a canonical universal Turing machine U on hand, “the complexity of f” aka K(f) means “the Solomonoff(U)-complexity of δ(f)” aka “the δ(f)-entropy of Solomonoff(U)”.
(I’m not wedded to the word “complexity” or the symbol “K” there. What I want is terms and symbols that let me quickly say something that means the ləg-sum-əxp of the code-lengths of all codes for f in the implicit coding scheme.)
(One reason I like the pronunciation “P-entropy of Q” is that saying “P-entropy” feels intuitively like we’re obvously weighting by P (perhaps b/c of the analogy to “P-expectation”), and so I expect it to help me remember which distribution is which in H(P, Q).)
As a fallback, I think ‘xentropy’ is decent, but my guess is that ‘crossent’ is better. It has fewer syllables, it’s already common, and it’s more obvious what it means.
(If I was using ‘crossent’ a lot I might start using ‘selfent’ to mean the entropy, b/c of the symmetry and the saved syllable. Three syllables for ‘entropy’ is kinda a lot!)
That said, I currently personally prefer using the same word for both in this case, and note that it’s very heavily precedented by info theory’s H.
Musing aloud for my own sake: it’s not in general bad to say “the general form is F(x,y), but when we say F(x) that means to infer y from context”. We do that even here, with the μ in entropy! The thing that’s bad is when F(x,x) is a very common case. In particular, the rule ”F(x) means F(x,figure_it_out)” and the rule ”F(x) means F(x,x)” are in conflict, and you can only have one of those active at a time. And in the case of entropy, the H(P,P) rule has a much, much stronger claim to the throne. So we concede that H(P) means H(P,P) and not H(P,figure_it_out), but without conceding that you can’t invent other notation that says Q should be inferred from context.
I agree with all the claims in this comment and I rather like your naming suggestions! Especially the “P-entropy of Q = Q-complexity of P” trick which seems to handle many use cases nicely.
(So the word “entropy” wasn’t really my crux? Maybe not!)
Heh, I’m still skimming enough to catch this, but definitely not evaluating arguments.
I’m definitely still open to both changing my mind about the best use of terms and also updating the terminology in the sequence (although I suspect that will be quite a non-trivial amount of modified prose). And I think it’s best if I don’t actually think about it until after I publish another post.
I’d also be much more inclined to think harder about this discussion if there were more than two people involved.
My main goal here has always been “clearly explain the existing content so that people can understand it”, which is very different from “propose a unification of the whole field” (it’s just that “unification” is my native method of understanding).
Cool cool. A summary of the claims that feel most important to me (for your convenience, and b/c I’m having fun):
K-complexity / “algorithmic entropy” is a bad concept that doesn’t cleanly relate to physics!entropy or info!entropy.
In particular, the K-complexity of a state s is just the length of the shortest code for s, and this is bad because when s has multiple codes it should count as “simpler”. (A state with five 3-bit codes is simpler than a state with one 2-bit code.) (Which is why symmetry makes a concept simpler despite not making its code shorter.)
If we correct our notion of “complexity” to take multiple codes into account, then we find that complexity of a state s (with respect to a coding scheme C) is just the info!cross-entropy H(s,C). Yay!
Separately, some gripes:
the algorithmic information theory concept is knuckleheaded, and only approximates info!entropy if you squint really hard, and I’m annoyed about it
I suspect that a bunch of the annoying theorems in algorithmic information theory are annoying precisely because of all the squinting you have to do to pretend that K-complexity was a good idea
And some pedagogical notes:
I’m all for descriptive accounts of who uses “entropy” for what, but it’s kinda a subtle situation because:
info!entropy is a very general concept,
physics!entropy is an interesting special case of that concept (in the case where the state is a particular breed of physical macrostate),
algo!entropy is a derpy mistake that’s sniffing glue in the corner,
algo!entropy is sitting right next to a heretofore unnamed concept that is another interesting special case of info!(cross-)entropy (in the case where the code is universal).
(oh and there’s a bonus subtlety that if you port algo!entropy to a situation where the coding schema has at most one code per state—which is emphatically not the case in algorithmic information theory—then in that limited case it agrees with info!cross-entropy. Which means that the derpiness of algo!entropy is just barely not relevant to this particular post.)
So, uh, good luck.
And separately, the key points about notation (while I’m summarizing, acknowledging that the notation might not to change in your coming posts):
The corrected notion of the “complexity” of a state s given a coding scheme C is H(s,C)
It’s confusing to call this the “entropy” of s, because that sounds like H(s), which by long-established convenient convention means H(s,s).
An elegant solution is to just call this the “(C-)complexity” of s instead.
This highlights a cool reciprocal relationship between (cross-)entropy and complexity, where the s-entropy of C is the same as the C-complexity of s. This can perhaps be mined for intuition.
Cool cool. I can personally see the appeal of reserving ‘entropy’ for the case where the distribution and the (natural lifting of) the code (to a distribution) are identical, i.e. your proposal without the wiggle-room.
I don’t yet personally see a boundary between the wiggle-room you’re considering and full-on “we can say ‘entropy’ as a shorthand for ‘cross-entropy’ when the second distribution is clear from context” proposal.
In particular, I currently suspect that there’s enough wiggle-room in “optimal code of its type for compression under the source distribution” to drive a truck through. Like, if we start out with a uniform code C and a state s, why not say that the “type of codes” for the source distribution δ(s) is the powerset of {c ∈ C | c codes for s}? In which case the “optimal code for compression” is the set of all such c, and the ‘entropy’ is the Nate!complexity?
I’m not yet sure whether our different aesthetics here are due to:
me failing to see a natural boundary that you’re pointing to
you not having yet seen how slippery the slope is
you having a higher tolerance for saying “humans sometimes just wanna put fences halfway down the slippery slope, dude”.
Insofar as you think I’m making the mistake of (1), I’m interested to hear arguments. My argument above is ofc tuned to case (2), and it’s plausible to me that it pushes you off the fence towards “no wiggle room”.
Another place we might asethetically differ is that I’m much happier blurring the line between entropy and cross-entropy.
One handwavy argument for blurring the line (which has the epistemic status: regurgitating from a related cache that doesn’t cleanly apply) is that if the statespace is uncountably infinite then we need a measure in order to talk about entropy (and make everything work out nicely under change-of-variables). And so in the general case, entropy is already a two-place
predicatefunction involving a distribution and some sort of measure. (...Although my cache is unclear on whether this generalization yields the KL divergence or the cross-entropy. Probably KL. But, like, if we’re holding the correct distribution fixed, then they differ only by a constant. Hopefully. This is as far as I can go w/out warming up that cache, and I don’t wanna warm it up right now \shrug.)Yup, I think I am happy to abandon the wiggle room at this point, for this reason.
I think my preferred approach to this is that the density p(x) is not really the fundamental object, and should be thought of as dP/dmu(x), with the measure in the denominator. We multiply by dmu(x) in the integral for entropy in order to remove this dependence on mu that we accidentally introduced.EDIT: this is flagrantly wrong because log(p) depends on the measure also. You’re right that this is really a function of the distribution and the measure; I’m not sure offhand if it’s crossentropy, either, but I’m going to think about this more. (This is an embarrassing mistake because I already knew differential entropy was cursed with dependence on a measure—quantum mechanics famously provides the measure on phase-space that classical statistical mechanics took as axiomatic.)For what it’s worth, I’ve heard the take “entropy and differential entropy are different sorts of things” several times; I might be coming around to that, now that I see another slippery slope on the horizon.
I’d be interested in a citation of what you’re referring to here!
The state-space (for particles) in statmech is the space of possible positions and momenta for all particles.
The measure that’s used is uniform over each coordinate of position and momentum, for each particle.
This is pretty obvious and natural, but not forced on us, and:
You get different, incorrect predictions about thermodynamics (!) if you use a different measure.
The level of coarse graining is unknown, so every quantity of entropy has an extra “+ log(# microstates per unit measure)” which is an unknown additive constant. (I think this is separate from the relationship between bits and J/K, which is a multiplicative constant for entropy—k_B—and doesn’t rely on QM afaik.)
On the other hand, Liouville’s theorem gives some pretty strong justification for using this measure, alleviating (1) somewhat:
https://en.wikipedia.org/wiki/Liouville%27s_theorem_(Hamiltonian)
In quantum mechanics, you have discrete energy eigenstates (...in a bound system, there are technicalities here...) and you can define a microstate to be an energy eigenstate, which lets you just count things and not worry about measure. This solves both problems:
Counting microstates and taking the classical limit gives the “dx dp” (aka “dq dp”) measure, ruling out any other measure.
It tells you how big your microstates are in phase space (the answer is related to Planck’s constant, which you’ll note has units of position * momentum).
This section mostly talks about the question of coarse-graining, but you can see that “dx dp” is sort of put in by hand in the classical version: https://en.wikipedia.org/wiki/Entropy_(statistical_thermodynamics)#Counting_of_microstates
I wish I had a better citation but I’m not sure I do.
In general it seems like (2) is talked about more in the literature, even though I think (1) is more interesting. This could be because Liouville’s theorem provides enough justification for most people’s tastes.
Finally, knowing “how big your microstates are” is what tells you where quantum effects kick in. (Or vice versa—Planck estimated the value of the Planck constant by adjusting the spacing of his quantized energy levels until his predictions for blackbody radiation matched the data.)
:D
(strong upvote for shifting position in realtime, even if by small degrees and towards the opposite side of the fence from me :-p)
(I’m not actually familiar enough w/ statmech to know what measure we use on phase-space, and I’m interested in links that explain what it is, and how it falls out of QM, if you have any handy :D)
I don’t currently have much sympathy for “entropy and differential entropy are different” view, b/c I occasionally have use for non-uniform measures even in the finite case.
Like, maybe I’m working with distributions over 10 states, and I have certain constraints I’m trying to hit, and subject to those constraints I wanna maximize entropy wrt the Binomial distribution as the measure. And you might be like “Well, stop doing that, and start working with distributions on 2^10 states, and convert all your constraints to refer to the bitcount of the new state (instead of the value of the old state). Now you can meet your constraints while maximizing entropy wrt the uniform measure like a normal person.” To which I’d reply (somewhat trollishly) “that rejoinder doesn’t work in the limit, so it must get weaker and weaker as I work with more and more states, and at numbers as large as 10 this rejoinder is so weak as to not compel me.”
From my perspective, the obvious rejoinder to “entropy is already two-place” is “insofar as entropy is two-place, cross-entropy is three-place!”. Which, ftr, I might find compelling. It depends whether differential cross-entropy needs all three parameters (P, Q, μ), or whether we can combine two of them (like by using (P, μP/Q) or something). Or, at least, that’s what my intuition says off-the-cuff; I’m running on guesswork and a stale cache here :-p.
I think this is roughly where I’m at now.
After thinking a bit and peeking at Wikipedia, the situation seems to be:
The differential entropy of a probability density p is usually defined as
−∫p(x)log(p(x))dx
This is unfortunate, because it isn’t invariant under coordinate transformations on x. A more principled (e.g. invariant) thing to write down, courtesy of Jaynes, is
−∫p(x)log(p(x)m(x))dx
where m=dμ/dx is a density function for some measure μ. We can also write this as
−∫log(dPdμ)dP (Jaynes’ continuous entropy of P with respect to μ)
in terms of a probability measure P with p=dP/dx, which is a bit more clearly invariant.
Now we can define a cross-entropy-like thing as
−∫log(dQdμ)dP (continuous cross-entropy of Q under P with respect to μ)
...and a small surprise is coming up. Jumping back to the discrete case, the KL divergence or “relative entropy” is
D(P||Q)=∑xP(X)log(P(x)Q(x))=CrossEntropy(P,Q)−Entropy(P)
What happens when we try to write something analogous with our new continuous entropy and crossentropy? We get
−∫log(dQdμ)dP+∫log(dPdμ)dP=∫log(dPdQ)dP
which looks like the right generalization of discrete KL divergence, but also happens to be the continuous entropy of P with respect to Q! (The dependence on μ drops out.) This is weird but I think it might be a red herring.
We can recover the usual definitions in the discrete (finite or infinite) case by taking μ to be a uniform measure that assigns 1 to each state (note that this is NOT a probability measure—I never said μ was a probability measure, and I don’t know how to handle the countably infinite case if it is.)
So maybe the nicest single thing to define, for the sake of making everything else a concisely-specified special case, is
H(P,Q;μ)=EP[−log(dQdμ)]=−∫log(dQdμ)dP
(“H” chosen because it is used for both entropy and cross-entropy, afaik unlike “S”).
We could take KL divergence as a two-argument atomic thing, or work with −∫log(f)dP for a scalar function f and a distribution P, but I think these both make it cumbersome to talk about the things everyone wants to talk about all the time. I’m open to those possibilities, though.
The weird observation about KL divergence above is a consequence of the identity
H(P,Q;μ)−H(P,R;μ)=H(P,Q;R)
but I think this is a slightly strange fact because R is a probability measure and μ is a general measure. Hmm.
We also have
H(P,Q;μ)=−H(P,μ;Q)
in the case that μ is a probability measure, but not otherwise.
And I can rattle off some claims about special cases in different contexts:
A (the?) fundamental theorem of entropy-like stuff is H(P,Q;μ)≥H(P,P;μ), or “crossentropy >= entropy”.
In the discrete case, practitioners of information theory and statistical physics tend to assume μ is uniform with measure 1 on each state, and (almost?) never discuss it directly. I’ll call this measure U.
They reserve the phrase “entropy of P” (in the discrete case) for H(P, P; U), and this turns out to be an extremely useful quantity.
They use “cross-entropy of Q relative to P” to mean H(P, Q; U)
They use “KL divergence from Q to P” (or “relative entropy …”) to mean H(P, Q; P) = -H(P, P; Q) = H(P, Q; U) - H(P, P; U)
For a state s, let’s define delta_s to be the distribution that puts measure 1 on s and 0 everywhere else.
For a code C, let’s define f_C to be the “coin flip” distribution that puts probability 2^-len(C(s)) on state s (or the sum of this over all codewords for s, if there are several). (Problem: what if C doesn’t assign all its available strings, so that this isn’t a probability distribution? idk.)
(see also https://en.wikipedia.org/wiki/Cross_entropy#Motivation)
If a state s has only one codeword C(s), the length of that codeword is len(C(s)) = H(delta_s, f_C; U)
Note that this is not H(delta_s, delta_s; U), the entropy of delta_s, which is always zero. This is a lot of why I don’t like taking “entropy of s” to mean H(delta_s, f_C; U).
Information theorists use “surprisal of s under P” (or “self-information …”) to mean H(delta_s, P; U)
The expected length of a code C, for a source distribution P, is H(P, f_C; U)
Consequence from fundamental thm: a code is optimal iff f_C = P; the minimal expected length of the code is the entropy of P
If you use the code C when you should’ve used D, your expected length is H(f_D, f_C; U) and you have foolishly wasted H(f_D, f_C; f_D) bits in expectation.
Physicists use “entropy of a macrostate” to mean things like H(p_A, p_A; U) where p_A is a uniform probability measure over a subset A of the state space, and possibly also for other things of the form H(P, P; U) where P is some probability measure that corresponds to your beliefs if you only know macroscopic things.
Kolmogorov complexity (“algorithmic entropy”) of a state s is, as you argue, maybe best seen as an approximation of H(delta_s, p_Solomonoff; U). I don’t see a good way of justifying the name by analogy to any “entropy” in any other context, but maybe I’m missing something.
This makes that one quote from Wiktionary into something of the form: “H(P, p_Solomonoff; U) ~= H(P, P, U) whenever P is computed by a short program”. This makes sense for some sense of “~=” that I haven’t thought through.
When working in continuous spaces like R, people typically define “[differential] entropy of P” as H(P, P; L) where L is the Lebesgue measure (including the uniform measure on R). If they are working with coordinate transformations from R to R, they implicitly treat one variable as the “true” one which the measure should be uniform on, often without saying so.
They define cross-entropy as H(P, Q; L)....
...and KL divergence as H(P, Q; L) - H(P, P; L) = H(P, Q; P) (same as discrete case!)
But sometimes people use Jaynes’ definitions, which replace L with an arbitrary (and explicitly-specified) measure, like we’re doing here. In that case, “entropy” is H(P, P; mu) for an arbitrary mu.
The properness of the logarithmic scoring rule also follows from the fundamental theorem. Your (subjective) expected score is H(your_beliefs, your_stated_beliefs; U) and this is minimized for your_stated_beliefs = your_beliefs.
I wanted to let that comment be about the interesting question of how we unify these various things.
But on the ongoing topic of “why not call all this entropy, if it’s all clearly part of the same pattern?”:
When the definition of some F(x) refers to x twice, it’s often useful to replace one of them with y and call that G(x, y). But it’s usually not good for communication to choose a name for G(x, y) that (almost) everyone else uses exclusively for F(x), especially if you aren’t going to mention both x and y every time you use it, and doubly especially if G is already popular enough to have lots of names of its own (you might hate those names, but get your own[1]).
e.g.: x*y is not “the square of x and y” much less “the square of x [and y is implied from context]”, and the dot product v.w is not “the norm-squared of v and w” etc.
[1] might I suggest “xentropy”?
:D
(strong upvote for delicious technical content)
(also fyi, the markdown syntax for footnotes is like
blah blah blah[^1]
and then somewhere else in the file, on a newline,[^1]: content of the footnote
)This updates me a fair bit towards the view that we should keep entropy and cross-entropy separate.
The remaining crux for me is whether the info theory folks are already using “entropy” to mean “the minimum expected surprise you can achieve by choosing a code from this here space of preferred codes” (as per your “wiggle room” above), in which case my inclination is to instead cackle madly while racing my truck through their wiggle-room, b/c that’s clearly a cross-entropy.
At a glance through wikipedia, it doesn’t look like they’re doing that, though, so I retreat.
But, hmm, I’m not sure I retreat all the way to having separate words.
I’m persuaded that “the square of x” should absolutely not mean “x * y [where y is implied from context]”, and I no longer think that “entropy” should mean the cross-entropy with Q implied from context. (Thanks for the distillation of why it’s mad!)
But, like, if geometrists instead invented a two-place operation
rect
, with a general formrect(x, y) := x * y
, and declared thatrect(x)
was shorthand forrect(x, x)
, then I would not protest; this seems like a reasonable an unambiguous way to reduce the number of names floating around.[1]And this seems to me like exactly what the information theorists (by contrast to the physicists) are doing with H (by contrast to S)!
Like, the notations H(P) and H(P,Q) are just begging for us to pronounce H the same way in each case; we’re not tempted to pronounce the former as “eych” and the latter as “cross-eych”. And no ambiguity arises, so long as we use the rule that if Q is left out then we mean P.
Thus, I propose the following aggressive nomenclature:
the “P-entropy of Q (wrt μ)” aka Hμ(P,Q) is the general form
the “entropy of P (wrt μ)” aka Hμ(P) is a shorthand for “the P-entropy of P (wrt μ)”
μ may be omitted when the canonical uniform measure is clear from context
With the following bold extension:
the “Q-complexity of P (wrt μ)” aka Kμ(Q,P) is another name for Hμ(P,Q)
the “complexity of P (wrt μ)” aka Kμ(P) is used when Q can be inferred from context
we cast codes/UTMs/points/sets to distributions in natural ways
Thus, when we have a canonical universal Turing machine U on hand, “the complexity of f” aka K(f) means “the Solomonoff(U)-complexity of δ(f)” aka “the δ(f)-entropy of Solomonoff(U)”.
(I’m not wedded to the word “complexity” or the symbol “K” there. What I want is terms and symbols that let me quickly say something that means the ləg-sum-əxp of the code-lengths of all codes for f in the implicit coding scheme.)
(One reason I like the pronunciation “P-entropy of Q” is that saying “P-entropy” feels intuitively like we’re obvously weighting by P (perhaps b/c of the analogy to “P-expectation”), and so I expect it to help me remember which distribution is which in H(P, Q).)
As a fallback, I think ‘xentropy’ is decent, but my guess is that ‘crossent’ is better. It has fewer syllables, it’s already common, and it’s more obvious what it means.
(If I was using ‘crossent’ a lot I might start using ‘selfent’ to mean the entropy, b/c of the symmetry and the saved syllable. Three syllables for ‘entropy’ is kinda a lot!)
That said, I currently personally prefer using the same word for both in this case, and note that it’s very heavily precedented by info theory’s H.
Musing aloud for my own sake: it’s not in general bad to say “the general form is F(x,y), but when we say F(x) that means to infer y from context”. We do that even here, with the μ in entropy! The thing that’s bad is when F(x,x) is a very common case. In particular, the rule ”F(x) means F(x,figure_it_out)” and the rule ”F(x) means F(x,x)” are in conflict, and you can only have one of those active at a time. And in the case of entropy, the H(P,P) rule has a much, much stronger claim to the throne. So we concede that H(P) means H(P,P) and not H(P,figure_it_out), but without conceding that you can’t invent other notation that says Q should be inferred from context.
I agree with all the claims in this comment and I rather like your naming suggestions! Especially the “P-entropy of Q = Q-complexity of P” trick which seems to handle many use cases nicely.
(So the word “entropy” wasn’t really my crux? Maybe not!)
Convergence! 🎉🎉. Thanks for the discussion; I think we wound up somewhere cooler than I would have gotten to on my own.
Now we just need to convince Alex :-p. Alex, are you still reading? I’m curious for your takes on our glorious proposal.
Heh, I’m still skimming enough to catch this, but definitely not evaluating arguments.
I’m definitely still open to both changing my mind about the best use of terms and also updating the terminology in the sequence (although I suspect that will be quite a non-trivial amount of modified prose). And I think it’s best if I don’t actually think about it until after I publish another post.
I’d also be much more inclined to think harder about this discussion if there were more than two people involved.
My main goal here has always been “clearly explain the existing content so that people can understand it”, which is very different from “propose a unification of the whole field” (it’s just that “unification” is my native method of understanding).
Cool cool. A summary of the claims that feel most important to me (for your convenience, and b/c I’m having fun):
K-complexity / “algorithmic entropy” is a bad concept that doesn’t cleanly relate to physics!entropy or info!entropy.
In particular, the K-complexity of a state s is just the length of the shortest code for s, and this is bad because when s has multiple codes it should count as “simpler”. (A state with five 3-bit codes is simpler than a state with one 2-bit code.) (Which is why symmetry makes a concept simpler despite not making its code shorter.)
If we correct our notion of “complexity” to take multiple codes into account, then we find that complexity of a state s (with respect to a coding scheme C) is just the info!cross-entropy H(s,C). Yay!
Separately, some gripes:
the algorithmic information theory concept is knuckleheaded, and only approximates info!entropy if you squint really hard, and I’m annoyed about it
I suspect that a bunch of the annoying theorems in algorithmic information theory are annoying precisely because of all the squinting you have to do to pretend that K-complexity was a good idea
And some pedagogical notes:
I’m all for descriptive accounts of who uses “entropy” for what, but it’s kinda a subtle situation because:
info!entropy is a very general concept,
physics!entropy is an interesting special case of that concept (in the case where the state is a particular breed of physical macrostate),
algo!entropy is a derpy mistake that’s sniffing glue in the corner,
algo!entropy is sitting right next to a heretofore unnamed concept that is another interesting special case of info!(cross-)entropy (in the case where the code is universal).
(oh and there’s a bonus subtlety that if you port algo!entropy to a situation where the coding schema has at most one code per state—which is emphatically not the case in algorithmic information theory—then in that limited case it agrees with info!cross-entropy. Which means that the derpiness of algo!entropy is just barely not relevant to this particular post.)
So, uh, good luck.
And separately, the key points about notation (while I’m summarizing, acknowledging that the notation might not to change in your coming posts):
The corrected notion of the “complexity” of a state s given a coding scheme C is H(s,C)
It’s confusing to call this the “entropy” of s, because that sounds like H(s), which by long-established convenient convention means H(s,s).
An elegant solution is to just call this the “(C-)complexity” of s instead.
This highlights a cool reciprocal relationship between (cross-)entropy and complexity, where the s-entropy of C is the same as the C-complexity of s. This can perhaps be mined for intuition.
Endorsed.