One wrinkle is that entropy isn’t quite minimum average description length—in general it’s a lower bound on average description length.
If you have a probability distribution that’s (2/3, 1⁄3) over two things, but you assign fixed binary strings to each of the two, then you can’t do better than 1 bit of average description length, but the entropy of the distribution is 0.92 bits.
Or if your distribution is roughly (.1135, .1135, .7729) over three things, then you can’t do better than 1.23 bits, but the entropy is 1 bit.
You can only hit the entropy exactly when the probabilities are all powers of 2.
(You can fix this a bit in the channel-coding context, where you’re encoding sequences of things and don’t have to assign fixed descriptions to individual things. In particular, you can assign descriptions to blocks of N things, which lets you get arbitrarily close as N → infinity.)
I think you can bring the two notions into harmony by allowing multiple codes per state (with the entropy/description-length of a state being the lug/nl of the fraction of the codespace that codes for that state).
For instance, you can think of a prefix-free code as a particularly well-behaved many-to-one assignment of infinite bitstrings to states, with (e.g.) the prefix-free code “0” corresponding to every infinite bitstring that starts with 0 (which is half of all infinite bitstrings, under the uniform measure).
If we consider all many-to-one assignments of infinite bitstrings to states (rather than just the special case of prefix-free codes) then there’ll always be an encoding that matches the entropy, without needing to say stuff like “well our description-length can get closer to the theoretical lower-bound as we imagine sending more and more blocks of independent data and taking the average per-block length”.
(If you want to keep the codespace finite, we can also see the entropy as the limit of how well we can do as we allow the codespace to increase in size.)
(I suspect that I can also often (always?) match the entropy if you let me design custom codespaces, where I can say stuff like “first we have a bit, and then depending on whether it’s 0 or 1, we follow it up by either a trit or a quadit”.)
(epistemic status: running off of a cache that doesn’t apply cleanly, but it smells right \shrug)
Sure, from one perspective what’s going on here is that we’re being given a distribution p and asked to come up with a distribution q such that
CrossEntropy(p, q) = E_p[-log q]
is as small as possible. And then a bit of calculus shows that q=p is optimal, with a minimal value of
Entropy(p) = CrossEntropy(p, p)
If we’re happy to call -log q “description length” right off the bat, we can let q be a distribution over the set of infinite bit strings, or the set of finite simple graphs, or over any (infinite) set we like.
But some settings are special, such as “q has to be the coin-flip distribution over a prefix-free code”, because in those settings our quantity -log q is forced to equal the length of something in the normal sense of “length of something”.
So the gap I’m interested in closing is between things that have actual lengths and things that are exactly equal to entropy, and the block encoding thing is the simplest way I know to do that.
I think using the coin-flip distribution over infinite strings is nice because it hits entropy exactly and has a clear relationship with the prefix-free-code case, but the block code motivates “length” better in isolation.
What’s your take on using “description length” for the length of a single description of a state, and “entropy” for the log-sum-exp of the description-lengths of all names for the state? (Or, well, ləg-sum-əxp, if you wanna avoid a buncha negations.)
I like it in part because the ləg-sum-əxp of all description-lengths seems to me like a better concept than K-complexity anyway. (They’ll often be similar, b/c ləg-sum-əxp is kinda softminish and the gap between description-lengths is often long, but when they differ it’s the ləg-sum-əxp’d thing that you usually want.)
For example, Solomonoff induction does not have the highest share of probability-mass on the lowest K-complexity hypothesis among those consistent with the data. It has the highest share of probability-mass on the hypothesis with lowest ləg-sum-əxp of all description-lengths among those consistent with the data.
This can matter sometimes. For instance, in physics we can’t always fix the gauge. Which means that any particular full description of physics needs to choose a full-fledged gauge, which spends an enormous amount of description-length. But this doesn’t count against physics, b/c for every possible gauge we could describe, there’s (same-length) ways of filling out the rest of the program such that it gives the right predictions. In the version of Solomonoff induction where hypotheses are deterministic programs, physics does not correspond to a short program, it corresponsd to an enormous number of long programs. With the number so enormous that the ləg-sum-əxp of all those big lengths is small.
More generally, this is related to the way that symmetry makes things simpler. If your code has a symmetry in it, that doesn’t make your program any shorter, but it does make the function/hypothesis your program represents simpler, not in terms of K-complexity but in terms of “entropy” (b/c, if S is the symmetry group, then there’s |S|-many programs of the ~same length that represent it, which decreases the ləg-sum-əxp by ~log(S)).
(Ofc, if we start thinking of hypotheses as being probabilistic programs instead, then we can unify all the symmetry-partners into a single probabilistic program that samples a gauge / element of the symmetry group. So K-complexity is often pretty close to the right concept, if you pick the right formalism. But even then I still think that the ləg-sum-əxp of all description-lengths is even more precisely what you want, e.g. when there’s multiple similar-length probabilistic programs that do the same thing.)
I’m not up to speed on the theorems that approximately relate “algorithmic entropy” to normal entropy, but I suspect that the conditions of those theorems are doing some hand-waving of the form “as the gap between description-lengths gets longer” (perhaps by iterating something that artificially inflates the gap and taking averages). If so, then I expect the ləg-sum-əxp of all description lengths to satisfy these theorems immediately and without the the handwaving.
I have now personally updated towards thinking that calling the K-complexity the “algorithmic entropy” is simply wrong. Close, but wrong. The thing deserving that name is the ləg-sum-əxp of all the description-lengths.
(Indeed, the whole concept of K-complexity seems close-but-wrong to me, and I’ve personally been using the ləg-sum-əxp concept in its place for a while now. I’ve been calling it just “complexity” in my own head, though, and hadn’t noticed that it might be worthy of the name “entropy”.)
epistemic status: spittballing; still working like 50% from cache
I still don’t like that, because this whole subthread is kind of orthogonal to my concerns about the word “entropy”.
This subthread is mostly about resolving the differences between a code (assignment of one or more codewords to one or more states) and a probability distribution. I think we’ve made progress on that and your latest comment is useful on that front.
But my concerns about “entropy” are of the form: “I notice that there’s a whole field of coding theory where ‘entropy’ means a particular function of a probability distribution, rather than a function of an individual state. This is also consistent with how physicists and other kinds of computer scientists use the word, except for the phrase ‘algorithmic entropy’. I think we should not break compatibility with this usage.”
Ignoring the differences between distributions and codes, I’d be fine with assigning “entropy” to various things shaped like either “sum of p(state) lug(p(state)) across all states” for a distribution or “sum of (2^-len(state)) len(state) across all states” for a code.
I am also fine with assigning it to “sum of p(state) len(state)” for a (distribution, code) pair that are matched in an appropriate sense—the distribution is the coinflip distribution for the code, or the code is optimal for the distribution, or something else roughly equivalent.
Elsewhere Alex and I have been referring to this as a pair of qualifiers “average” (i.e. it’s a sum over all states weighted by p or 2^-len) and “minimal” (i.e. the two factors in the sum are for matching or identical codes/distributions).
“Average” distinguishes entropy from the things information theorists call “length” or “self-information” or “surprisal” or just “[negative] log-prob”, and “minimal” distinguishes entropy from “expected length [of an arbitrary code]” or “cross-entropy”.
Cool thanks. I’m hearing you as saying “I want to reserve ‘entropy’ for the case where we’re weighting the length-like thingies by probability-like thingies”, which seems reasonable to me.
I’m not sure I follow the part about matched (distribution, code) pairs. To check my understanding: for a sufficiently forgiving notion of “matching”, this is basically going to yield the cross-entropy, right? Where, IIUC, we’ve lifted the code to a distribution in some natural way (essentially using a uniform distribution, though there might be a translation step like translating prefix-free codes to sets of infinite bitstrings), and then once we have two distributions we take the cross-entropy.
(One of my hypotheses for what you’re saying is “when the distribution and the code are both clear from context, we can shorten ‘cross-entropy’ to ‘entropy’. Which, ftr, seems reasonable to me.)
My own proclivities would say: if I specify only a state and a code, then the state lifts to a distribution by Kronecker’s delta and the code lifts to a distribution uniformly, and I arbitrarily declare these to ‘match’, and so when we speak of the (cross-)entropy of a state given a code we mean the length of the code(s) for that particular state (combined by ləg-sum-əxp if there’s multiple).
This seems like the natural way to ‘match’ a state and a code, to my eye. But I acknowledge that what counts as ‘matching’ is a matter of intuition and convention, and that others’ may differ from mine.
At this point, though, the outcome I’m most invested in is emerging with a short name for “the ləg-sum-əxp of the lengths of all the descriptions”. I’m fine with naming it some variation on ‘complexity’, though. (Komolgorov kindly left a K in K-complexity, so there’s ample room to pick another letter if we have to.)
(Though to be very explicit about my personal preferences, I’d use “entropy”. It seems to me that once we’ve conceded that we can talk about the entropy of a (distribution, code) pair then we might as well extend the notation to include (state, code) pairs; they’re sibling shorthands. Or, to say it more trollishly: the Nate!complexity of a function is a length-like thingy weighted by a probability-like thingy, where the probability-like thingy is 1 :-p.)
(One of my hypotheses for what you’re saying is “when the distribution and the code are both clear from context, we can shorten ‘cross-entropy’ to ‘entropy’. Which, ftr, seems reasonable to me.)
I want something much more demanding—I want the distribution and code to be “the same” (related by p = 2^-len), or something “as close as possible” to that.
I was leaving a little bit of wiggle room to possibly include “a code matches a distribution if it is the optimal code of its type for compression under that source distribution”, but this is only supposed to allow rounding errors; it seems sort of okay to say that the expected length of (0, 10, 11) under the distribution (0.4, 0.3, 0.3) is some (not quite standard) sort of entropy for that distribution, but not okay to say that the expected length of (0, 10, 11) under (0., 0., 1.) is an entropy.
But I’m on the fence about even giving that much wiggle room.
That’s the only reason I exclude single states. I agree that the length of a state is a kind of cross-entropy, because you can choose a delta distribution, but I draw a firm line between cross-entropy and entropy.
(Obviously there’s a special case, where a code that has a single empty codeword for a single state matches a delta distribution. But not if the codeword isn’t the empty string.)
I wonder if it would be reasonable to use “xentropy” for the broad sense of “entropy” in OP, with the understanding that xentropy is always a two-argument function.
“The length of a codeword is the xentropy between [the delta distribution located at] the state and [the coinflip distribution implied by] the code”
Cool cool. I can personally see the appeal of reserving ‘entropy’ for the case where the distribution and the (natural lifting of) the code (to a distribution) are identical, i.e. your proposal without the wiggle-room.
I don’t yet personally see a boundary between the wiggle-room you’re considering and full-on “we can say ‘entropy’ as a shorthand for ‘cross-entropy’ when the second distribution is clear from context” proposal.
In particular, I currently suspect that there’s enough wiggle-room in “optimal code of its type for compression under the source distribution” to drive a truck through. Like, if we start out with a uniform code C and a state s, why not say that the “type of codes” for the source distribution δ(s) is the powerset of {c ∈ C | c codes for s}? In which case the “optimal code for compression” is the set of all such c, and the ‘entropy’ is the Nate!complexity?
I’m not yet sure whether our different aesthetics here are due to:
me failing to see a natural boundary that you’re pointing to
you not having yet seen how slippery the slope is
you having a higher tolerance for saying “humans sometimes just wanna put fences halfway down the slippery slope, dude”.
Insofar as you think I’m making the mistake of (1), I’m interested to hear arguments. My argument above is ofc tuned to case (2), and it’s plausible to me that it pushes you off the fence towards “no wiggle room”.
Another place we might asethetically differ is that I’m much happier blurring the line between entropy and cross-entropy.
One handwavy argument for blurring the line (which has the epistemic status: regurgitating from a related cache that doesn’t cleanly apply) is that if the statespace is uncountably infinite then we need a measure in order to talk about entropy (and make everything work out nicely under change-of-variables). And so in the general case, entropy is already a two-place predicate function involving a distribution and some sort of measure. (...Although my cache is unclear on whether this generalization yields the KL divergence or the cross-entropy. Probably KL. But, like, if we’re holding the correct distribution fixed, then they differ only by a constant. Hopefully. This is as far as I can go w/out warming up that cache, and I don’t wanna warm it up right now \shrug.)
My argument above is ofc tuned to case (2), and it’s plausible to me that it pushes you off the fence towards “no wiggle room”.
Yup, I think I am happy to abandon the wiggle room at this point, for this reason.
if the statespace is uncountably infinite then we need a measure in order to talk about entropy (and make everything work out nicely under change-of-variables). And so in the general case, entropy is already a two-place predicate involving a distribution and some sort of measure.
I think my preferred approach to this is that the density p(x) is not really the fundamental object, and should be thought of as dP/dmu(x), with the measure in the denominator. We multiply by dmu(x) in the integral for entropy in order to remove this dependence on mu that we accidentally introduced. EDIT: this is flagrantly wrong because log(p) depends on the measure also. You’re right that this is really a function of the distribution and the measure; I’m not sure offhand if it’s crossentropy, either, but I’m going to think about this more. (This is an embarrassing mistake because I already knew differential entropy was cursed with dependence on a measure—quantum mechanics famously provides the measure on phase-space that classical statistical mechanics took as axiomatic.)
For what it’s worth, I’ve heard the take “entropy and differential entropy are different sorts of things” several times; I might be coming around to that, now that I see another slippery slope on the horizon.
The state-space (for particles) in statmech is the space of possible positions and momenta for all particles.
The measure that’s used is uniform over each coordinate of position and momentum, for each particle.
This is pretty obvious and natural, but not forced on us, and:
You get different, incorrect predictions about thermodynamics (!) if you use a different measure.
The level of coarse graining is unknown, so every quantity of entropy has an extra “+ log(# microstates per unit measure)” which is an unknown additive constant. (I think this is separate from the relationship between bits and J/K, which is a multiplicative constant for entropy—k_B—and doesn’t rely on QM afaik.)
On the other hand, Liouville’s theorem gives some pretty strong justification for using this measure, alleviating (1) somewhat:
In quantum mechanics, you have discrete energy eigenstates (...in a bound system, there are technicalities here...) and you can define a microstate to be an energy eigenstate, which lets you just count things and not worry about measure. This solves both problems:
Counting microstates and taking the classical limit gives the “dx dp” (aka “dq dp”) measure, ruling out any other measure.
It tells you how big your microstates are in phase space (the answer is related to Planck’s constant, which you’ll note has units of position * momentum).
I wish I had a better citation but I’m not sure I do.
In general it seems like (2) is talked about more in the literature, even though I think (1) is more interesting. This could be because Liouville’s theorem provides enough justification for most people’s tastes.
Finally, knowing “how big your microstates are” is what tells you where quantum effects kick in. (Or vice versa—Planck estimated the value of the Planck constant by adjusting the spacing of his quantized energy levels until his predictions for blackbody radiation matched the data.)
(strong upvote for shifting position in realtime, even if by small degrees and towards the opposite side of the fence from me :-p)
(I’m not actually familiar enough w/ statmech to know what measure we use on phase-space, and I’m interested in links that explain what it is, and how it falls out of QM, if you have any handy :D)
I don’t currently have much sympathy for “entropy and differential entropy are different” view, b/c I occasionally have use for non-uniform measures even in the finite case.
Like, maybe I’m working with distributions over 10 states, and I have certain constraints I’m trying to hit, and subject to those constraints I wanna maximize entropy wrt the Binomial distribution as the measure. And you might be like “Well, stop doing that, and start working with distributions on 2^10 states, and convert all your constraints to refer to the bitcount of the new state (instead of the value of the old state). Now you can meet your constraints while maximizing entropy wrt the uniform measure like a normal person.” To which I’d reply (somewhat trollishly) “that rejoinder doesn’t work in the limit, so it must get weaker and weaker as I work with more and more states, and at numbers as large as 10 this rejoinder is so weak as to not compel me.”
From my perspective, the obvious rejoinder to “entropy is already two-place” is “insofar as entropy is two-place, cross-entropy is three-place!”. Which, ftr, I might find compelling. It depends whether differential cross-entropy needs all three parameters (P, Q, μ), or whether we can combine two of them (like by using (P, μP/Q) or something). Or, at least, that’s what my intuition says off-the-cuff; I’m running on guesswork and a stale cache here :-p.
From my perspective, the obvious rejoinder to “entropy is already two-place” is “insofar as entropy is two-place, cross-entropy is three-place!”.
I think this is roughly where I’m at now.
After thinking a bit and peeking at Wikipedia, the situation seems to be:
The differential entropy of a probability density p is usually defined as
−∫p(x)log(p(x))dx
This is unfortunate, because it isn’t invariant under coordinate transformations on x. A more principled (e.g. invariant) thing to write down, courtesy of Jaynes, is
−∫p(x)log(p(x)m(x))dx
where m=dμ/dx is a density function for some measure μ. We can also write this as
−∫log(dPdμ)dP (Jaynes’ continuous entropy of P with respect to μ)
in terms of a probability measure P with p=dP/dx, which is a bit more clearly invariant.
Now we can define a cross-entropy-like thing as
−∫log(dQdμ)dP (continuous cross-entropy of Q under P with respect to μ)
...and a small surprise is coming up. Jumping back to the discrete case, the KL divergence or “relative entropy” is
What happens when we try to write something analogous with our new continuous entropy and crossentropy? We get
−∫log(dQdμ)dP+∫log(dPdμ)dP=∫log(dPdQ)dP
which looks like the right generalization of discrete KL divergence, but also happens to be the continuous entropy of P with respect to Q! (The dependence on μ drops out.) This is weird but I think it might be a red herring.
We can recover the usual definitions in the discrete (finite or infinite) case by taking μ to be a uniform measure that assigns 1 to each state (note that this is NOT a probability measure—I never said μ was a probability measure, and I don’t know how to handle the countably infinite case if it is.)
So maybe the nicest single thing to define, for the sake of making everything else a concisely-specified special case, is
H(P,Q;μ)=EP[−log(dQdμ)]=−∫log(dQdμ)dP
(“H” chosen because it is used for both entropy and cross-entropy, afaik unlike “S”).
We could take KL divergence as a two-argument atomic thing, or work with −∫log(f)dP for a scalar function f and a distribution P, but I think these both make it cumbersome to talk about the things everyone wants to talk about all the time. I’m open to those possibilities, though.
The weird observation about KL divergence above is a consequence of the identity
H(P,Q;μ)−H(P,R;μ)=H(P,Q;R)
but I think this is a slightly strange fact because R is a probability measure and μ is a general measure. Hmm.
We also have
H(P,Q;μ)=−H(P,μ;Q)
in the case that μ is a probability measure, but not otherwise.
And I can rattle off some claims about special cases in different contexts:
A (the?) fundamental theorem of entropy-like stuff is H(P,Q;μ)≥H(P,P;μ), or “crossentropy >= entropy”.
In the discrete case, practitioners of information theory and statistical physics tend to assume μ is uniform with measure 1 on each state, and (almost?) never discuss it directly. I’ll call this measure U.
They reserve the phrase “entropy of P” (in the discrete case) for H(P, P; U), and this turns out to be an extremely useful quantity.
They use “cross-entropy of Q relative to P” to mean H(P, Q; U)
They use “KL divergence from Q to P” (or “relative entropy …”) to mean H(P, Q; P) = -H(P, P; Q) = H(P, Q; U) - H(P, P; U)
For a state s, let’s define delta_s to be the distribution that puts measure 1 on s and 0 everywhere else.
For a code C, let’s define f_C to be the “coin flip” distribution that puts probability 2^-len(C(s)) on state s (or the sum of this over all codewords for s, if there are several). (Problem: what if C doesn’t assign all its available strings, so that this isn’t a probability distribution? idk.)
If a state s has only one codeword C(s), the length of that codeword is len(C(s)) = H(delta_s, f_C; U)
Note that this is not H(delta_s, delta_s; U), the entropy of delta_s, which is always zero. This is a lot of why I don’t like taking “entropy of s” to mean H(delta_s, f_C; U).
Information theorists use “surprisal of s under P” (or “self-information …”) to mean H(delta_s, P; U)
The expected length of a code C, for a source distribution P, is H(P, f_C; U)
Consequence from fundamental thm: a code is optimal iff f_C = P; the minimal expected length of the code is the entropy of P
If you use the code C when you should’ve used D, your expected length is H(f_D, f_C; U) and you have foolishly wasted H(f_D, f_C; f_D) bits in expectation.
Physicists use “entropy of a macrostate” to mean things like H(p_A, p_A; U) where p_A is a uniform probability measure over a subset A of the state space, and possibly also for other things of the form H(P, P; U) where P is some probability measure that corresponds to your beliefs if you only know macroscopic things.
Kolmogorov complexity (“algorithmic entropy”) of a state s is, as you argue, maybe best seen as an approximation of H(delta_s, p_Solomonoff; U). I don’t see a good way of justifying the name by analogy to any “entropy” in any other context, but maybe I’m missing something.
This makes that one quote from Wiktionary into something of the form: “H(P, p_Solomonoff; U) ~= H(P, P, U) whenever P is computed by a short program”. This makes sense for some sense of “~=” that I haven’t thought through.
When working in continuous spaces like R, people typically define “[differential] entropy of P” as H(P, P; L) where L is the Lebesgue measure (including the uniform measure on R). If they are working with coordinate transformations from R to R, they implicitly treat one variable as the “true” one which the measure should be uniform on, often without saying so.
They define cross-entropy as H(P, Q; L)....
...and KL divergence as H(P, Q; L) - H(P, P; L) = H(P, Q; P) (same as discrete case!)
But sometimes people use Jaynes’ definitions, which replace L with an arbitrary (and explicitly-specified) measure, like we’re doing here. In that case, “entropy” is H(P, P; mu) for an arbitrary mu.
The properness of the logarithmic scoring rule also follows from the fundamental theorem. Your (subjective) expected score is H(your_beliefs, your_stated_beliefs; U) and this is minimized for your_stated_beliefs = your_beliefs.
I wanted to let that comment be about the interesting question of how we unify these various things.
But on the ongoing topic of “why not call all this entropy, if it’s all clearly part of the same pattern?”:
When the definition of some F(x) refers to x twice, it’s often useful to replace one of them with y and call that G(x, y). But it’s usually not good for communication to choose a name for G(x, y) that (almost) everyone else uses exclusively for F(x),especially if you aren’t going to mention both x and y every time you use it, and doubly especially if G is already popular enough to have lots of names of its own (you might hate those names, but get your own[1]).
e.g.: x*y is not “the square of x and y” much less “the square of x [and y is implied from context]”, and the dot product v.w is not “the norm-squared of v and w” etc.
(also fyi, the markdown syntax for footnotes is like blah blah blah[^1] and then somewhere else in the file, on a newline, [^1]: content of the footnote)
This updates me a fair bit towards the view that we should keep entropy and cross-entropy separate.
The remaining crux for me is whether the info theory folks are already using “entropy” to mean “the minimum expected surprise you can achieve by choosing a code from this here space of preferred codes” (as per your “wiggle room” above), in which case my inclination is to instead cackle madly while racing my truck through their wiggle-room, b/c that’s clearly a cross-entropy.
At a glance through wikipedia, it doesn’t look like they’re doing that, though, so I retreat.
But, hmm, I’m not sure I retreat all the way to having separate words.
I’m persuaded that “the square of x” should absolutely not mean “x * y [where y is implied from context]”, and I no longer think that “entropy” should mean the cross-entropy with Q implied from context. (Thanks for the distillation of why it’s mad!)
But, like, if geometrists instead invented a two-place operation rect, with a general form rect(x, y) := x * y, and declared that rect(x) was shorthand for rect(x, x), then I would not protest; this seems like a reasonable an unambiguous way to reduce the number of names floating around.[1]
And this seems to me like exactly what the information theorists (by contrast to the physicists) are doing with H (by contrast to S)!
Like, the notations H(P) and H(P,Q) are just begging for us to pronounce H the same way in each case; we’re not tempted to pronounce the former as “eych” and the latter as “cross-eych”. And no ambiguity arises, so long as we use the rule that if Q is left out then we mean P.
Thus, I propose the following aggressive nomenclature:
the “P-entropy of Q (wrt μ)” aka Hμ(P,Q) is the general form
the “entropy of P (wrt μ)” aka Hμ(P) is a shorthand for “the P-entropy of P (wrt μ)”
μ may be omitted when the canonical uniform measure is clear from context
With the following bold extension:
the “Q-complexity of P (wrt μ)” aka Kμ(Q,P) is another name for Hμ(P,Q)
the “complexity of P (wrt μ)” aka Kμ(P) is used when Q can be inferred from context
we cast codes/UTMs/points/sets to distributions in natural ways
Thus, when we have a canonical universal Turing machine U on hand, “the complexity of f” aka K(f) means “the Solomonoff(U)-complexity of δ(f)” aka “the δ(f)-entropy of Solomonoff(U)”.
(I’m not wedded to the word “complexity” or the symbol “K” there. What I want is terms and symbols that let me quickly say something that means the ləg-sum-əxp of the code-lengths of all codes for f in the implicit coding scheme.)
(One reason I like the pronunciation “P-entropy of Q” is that saying “P-entropy” feels intuitively like we’re obvously weighting by P (perhaps b/c of the analogy to “P-expectation”), and so I expect it to help me remember which distribution is which in H(P, Q).)
As a fallback, I think ‘xentropy’ is decent, but my guess is that ‘crossent’ is better. It has fewer syllables, it’s already common, and it’s more obvious what it means.
(If I was using ‘crossent’ a lot I might start using ‘selfent’ to mean the entropy, b/c of the symmetry and the saved syllable. Three syllables for ‘entropy’ is kinda a lot!)
That said, I currently personally prefer using the same word for both in this case, and note that it’s very heavily precedented by info theory’s H.
Musing aloud for my own sake: it’s not in general bad to say “the general form is F(x,y), but when we say F(x) that means to infer y from context”. We do that even here, with the μ in entropy! The thing that’s bad is when F(x,x) is a very common case. In particular, the rule ”F(x) means F(x,figure_it_out)” and the rule ”F(x) means F(x,x)” are in conflict, and you can only have one of those active at a time. And in the case of entropy, the H(P,P) rule has a much, much stronger claim to the throne. So we concede that H(P) means H(P,P) and not H(P,figure_it_out), but without conceding that you can’t invent other notation that says Q should be inferred from context.
I agree with all the claims in this comment and I rather like your naming suggestions! Especially the “P-entropy of Q = Q-complexity of P” trick which seems to handle many use cases nicely.
(So the word “entropy” wasn’t really my crux? Maybe not!)
Heh, I’m still skimming enough to catch this, but definitely not evaluating arguments.
I’m definitely still open to both changing my mind about the best use of terms and also updating the terminology in the sequence (although I suspect that will be quite a non-trivial amount of modified prose). And I think it’s best if I don’t actually think about it until after I publish another post.
I’d also be much more inclined to think harder about this discussion if there were more than two people involved.
My main goal here has always been “clearly explain the existing content so that people can understand it”, which is very different from “propose a unification of the whole field” (it’s just that “unification” is my native method of understanding).
Cool cool. A summary of the claims that feel most important to me (for your convenience, and b/c I’m having fun):
K-complexity / “algorithmic entropy” is a bad concept that doesn’t cleanly relate to physics!entropy or info!entropy.
In particular, the K-complexity of a state s is just the length of the shortest code for s, and this is bad because when s has multiple codes it should count as “simpler”. (A state with five 3-bit codes is simpler than a state with one 2-bit code.) (Which is why symmetry makes a concept simpler despite not making its code shorter.)
If we correct our notion of “complexity” to take multiple codes into account, then we find that complexity of a state s (with respect to a coding scheme C) is just the info!cross-entropy H(s,C). Yay!
Separately, some gripes:
the algorithmic information theory concept is knuckleheaded, and only approximates info!entropy if you squint really hard, and I’m annoyed about it
I suspect that a bunch of the annoying theorems in algorithmic information theory are annoying precisely because of all the squinting you have to do to pretend that K-complexity was a good idea
And some pedagogical notes:
I’m all for descriptive accounts of who uses “entropy” for what, but it’s kinda a subtle situation because:
info!entropy is a very general concept,
physics!entropy is an interesting special case of that concept (in the case where the state is a particular breed of physical macrostate),
algo!entropy is a derpy mistake that’s sniffing glue in the corner,
algo!entropy is sitting right next to a heretofore unnamed concept that is another interesting special case of info!(cross-)entropy (in the case where the code is universal).
(oh and there’s a bonus subtlety that if you port algo!entropy to a situation where the coding schema has at most one code per state—which is emphatically not the case in algorithmic information theory—then in that limited case it agrees with info!cross-entropy. Which means that the derpiness of algo!entropy is just barely not relevant to this particular post.)
So, uh, good luck.
And separately, the key points about notation (while I’m summarizing, acknowledging that the notation might not to change in your coming posts):
The corrected notion of the “complexity” of a state s given a coding scheme C is H(s,C)
It’s confusing to call this the “entropy” of s, because that sounds like H(s), which by long-established convenient convention means H(s,s).
An elegant solution is to just call this the “(C-)complexity” of s instead.
This highlights a cool reciprocal relationship between (cross-)entropy and complexity, where the s-entropy of C is the same as the C-complexity of s. This can perhaps be mined for intuition.
I like “description length”.
One wrinkle is that entropy isn’t quite minimum average description length—in general it’s a lower bound on average description length.
If you have a probability distribution that’s (2/3, 1⁄3) over two things, but you assign fixed binary strings to each of the two, then you can’t do better than 1 bit of average description length, but the entropy of the distribution is 0.92 bits.
Or if your distribution is roughly (.1135, .1135, .7729) over three things, then you can’t do better than 1.23 bits, but the entropy is 1 bit.
You can only hit the entropy exactly when the probabilities are all powers of 2.
(You can fix this a bit in the channel-coding context, where you’re encoding sequences of things and don’t have to assign fixed descriptions to individual things. In particular, you can assign descriptions to blocks of N things, which lets you get arbitrarily close as N → infinity.)
I think you can bring the two notions into harmony by allowing multiple codes per state (with the entropy/description-length of a state being the lug/nl of the fraction of the codespace that codes for that state).
For instance, you can think of a prefix-free code as a particularly well-behaved many-to-one assignment of infinite bitstrings to states, with (e.g.) the prefix-free code “0” corresponding to every infinite bitstring that starts with 0 (which is half of all infinite bitstrings, under the uniform measure).
If we consider all many-to-one assignments of infinite bitstrings to states (rather than just the special case of prefix-free codes) then there’ll always be an encoding that matches the entropy, without needing to say stuff like “well our description-length can get closer to the theoretical lower-bound as we imagine sending more and more blocks of independent data and taking the average per-block length”.
(If you want to keep the codespace finite, we can also see the entropy as the limit of how well we can do as we allow the codespace to increase in size.)
(I suspect that I can also often (always?) match the entropy if you let me design custom codespaces, where I can say stuff like “first we have a bit, and then depending on whether it’s 0 or 1, we follow it up by either a trit or a quadit”.)
(epistemic status: running off of a cache that doesn’t apply cleanly, but it smells right \shrug)
Sure, from one perspective what’s going on here is that we’re being given a distribution p and asked to come up with a distribution q such that
CrossEntropy(p, q) = E_p[-log q]
is as small as possible. And then a bit of calculus shows that q=p is optimal, with a minimal value of
Entropy(p) = CrossEntropy(p, p)
If we’re happy to call -log q “description length” right off the bat, we can let q be a distribution over the set of infinite bit strings, or the set of finite simple graphs, or over any (infinite) set we like.
But some settings are special, such as “q has to be the coin-flip distribution over a prefix-free code”, because in those settings our quantity -log q is forced to equal the length of something in the normal sense of “length of something”.
So the gap I’m interested in closing is between things that have actual lengths and things that are exactly equal to entropy, and the block encoding thing is the simplest way I know to do that.
I think using the coin-flip distribution over infinite strings is nice because it hits entropy exactly and has a clear relationship with the prefix-free-code case, but the block code motivates “length” better in isolation.
What’s your take on using “description length” for the length of a single description of a state, and “entropy” for the log-sum-exp of the description-lengths of all names for the state? (Or, well, ləg-sum-əxp, if you wanna avoid a buncha negations.)
I like it in part because the ləg-sum-əxp of all description-lengths seems to me like a better concept than K-complexity anyway. (They’ll often be similar, b/c ləg-sum-əxp is kinda softminish and the gap between description-lengths is often long, but when they differ it’s the ləg-sum-əxp’d thing that you usually want.)
For example, Solomonoff induction does not have the highest share of probability-mass on the lowest K-complexity hypothesis among those consistent with the data. It has the highest share of probability-mass on the hypothesis with lowest ləg-sum-əxp of all description-lengths among those consistent with the data.
This can matter sometimes. For instance, in physics we can’t always fix the gauge. Which means that any particular full description of physics needs to choose a full-fledged gauge, which spends an enormous amount of description-length. But this doesn’t count against physics, b/c for every possible gauge we could describe, there’s (same-length) ways of filling out the rest of the program such that it gives the right predictions. In the version of Solomonoff induction where hypotheses are deterministic programs, physics does not correspond to a short program, it corresponsd to an enormous number of long programs. With the number so enormous that the ləg-sum-əxp of all those big lengths is small.
More generally, this is related to the way that symmetry makes things simpler. If your code has a symmetry in it, that doesn’t make your program any shorter, but it does make the function/hypothesis your program represents simpler, not in terms of K-complexity but in terms of “entropy” (b/c, if S is the symmetry group, then there’s |S|-many programs of the ~same length that represent it, which decreases the ləg-sum-əxp by ~log(S)).
(Ofc, if we start thinking of hypotheses as being probabilistic programs instead, then we can unify all the symmetry-partners into a single probabilistic program that samples a gauge / element of the symmetry group. So K-complexity is often pretty close to the right concept, if you pick the right formalism. But even then I still think that the ləg-sum-əxp of all description-lengths is even more precisely what you want, e.g. when there’s multiple similar-length probabilistic programs that do the same thing.)
I’m not up to speed on the theorems that approximately relate “algorithmic entropy” to normal entropy, but I suspect that the conditions of those theorems are doing some hand-waving of the form “as the gap between description-lengths gets longer” (perhaps by iterating something that artificially inflates the gap and taking averages). If so, then I expect the ləg-sum-əxp of all description lengths to satisfy these theorems immediately and without the the handwaving.
I have now personally updated towards thinking that calling the K-complexity the “algorithmic entropy” is simply wrong. Close, but wrong. The thing deserving that name is the ləg-sum-əxp of all the description-lengths.
(Indeed, the whole concept of K-complexity seems close-but-wrong to me, and I’ve personally been using the ləg-sum-əxp concept in its place for a while now. I’ve been calling it just “complexity” in my own head, though, and hadn’t noticed that it might be worthy of the name “entropy”.)
epistemic status: spittballing; still working like 50% from cache
I still don’t like that, because this whole subthread is kind of orthogonal to my concerns about the word “entropy”.
This subthread is mostly about resolving the differences between a code (assignment of one or more codewords to one or more states) and a probability distribution. I think we’ve made progress on that and your latest comment is useful on that front.
But my concerns about “entropy” are of the form: “I notice that there’s a whole field of coding theory where ‘entropy’ means a particular function of a probability distribution, rather than a function of an individual state. This is also consistent with how physicists and other kinds of computer scientists use the word, except for the phrase ‘algorithmic entropy’. I think we should not break compatibility with this usage.”
Ignoring the differences between distributions and codes, I’d be fine with assigning “entropy” to various things shaped like either “sum of p(state) lug(p(state)) across all states” for a distribution or “sum of (2^-len(state)) len(state) across all states” for a code.
I am also fine with assigning it to “sum of p(state) len(state)” for a (distribution, code) pair that are matched in an appropriate sense—the distribution is the coinflip distribution for the code, or the code is optimal for the distribution, or something else roughly equivalent.
Elsewhere Alex and I have been referring to this as a pair of qualifiers “average” (i.e. it’s a sum over all states weighted by p or 2^-len) and “minimal” (i.e. the two factors in the sum are for matching or identical codes/distributions).
“Average” distinguishes entropy from the things information theorists call “length” or “self-information” or “surprisal” or just “[negative] log-prob”, and “minimal” distinguishes entropy from “expected length [of an arbitrary code]” or “cross-entropy”.
Cool thanks. I’m hearing you as saying “I want to reserve ‘entropy’ for the case where we’re weighting the length-like thingies by probability-like thingies”, which seems reasonable to me.
I’m not sure I follow the part about matched (distribution, code) pairs. To check my understanding: for a sufficiently forgiving notion of “matching”, this is basically going to yield the cross-entropy, right? Where, IIUC, we’ve lifted the code to a distribution in some natural way (essentially using a uniform distribution, though there might be a translation step like translating prefix-free codes to sets of infinite bitstrings), and then once we have two distributions we take the cross-entropy.
(One of my hypotheses for what you’re saying is “when the distribution and the code are both clear from context, we can shorten ‘cross-entropy’ to ‘entropy’. Which, ftr, seems reasonable to me.)
My own proclivities would say: if I specify only a state and a code, then the state lifts to a distribution by Kronecker’s delta and the code lifts to a distribution uniformly, and I arbitrarily declare these to ‘match’, and so when we speak of the (cross-)entropy of a state given a code we mean the length of the code(s) for that particular state (combined by ləg-sum-əxp if there’s multiple).
This seems like the natural way to ‘match’ a state and a code, to my eye. But I acknowledge that what counts as ‘matching’ is a matter of intuition and convention, and that others’ may differ from mine.
At this point, though, the outcome I’m most invested in is emerging with a short name for “the ləg-sum-əxp of the lengths of all the descriptions”. I’m fine with naming it some variation on ‘complexity’, though. (Komolgorov kindly left a K in K-complexity, so there’s ample room to pick another letter if we have to.)
(Though to be very explicit about my personal preferences, I’d use “entropy”. It seems to me that once we’ve conceded that we can talk about the entropy of a (distribution, code) pair then we might as well extend the notation to include (state, code) pairs; they’re sibling shorthands. Or, to say it more trollishly: the Nate!complexity of a function is a length-like thingy weighted by a probability-like thingy, where the probability-like thingy is 1 :-p.)
I want something much more demanding—I want the distribution and code to be “the same” (related by p = 2^-len), or something “as close as possible” to that.
I was leaving a little bit of wiggle room to possibly include “a code matches a distribution if it is the optimal code of its type for compression under that source distribution”, but this is only supposed to allow rounding errors; it seems sort of okay to say that the expected length of (0, 10, 11) under the distribution (0.4, 0.3, 0.3) is some (not quite standard) sort of entropy for that distribution, but not okay to say that the expected length of (0, 10, 11) under (0., 0., 1.) is an entropy.
But I’m on the fence about even giving that much wiggle room.
That’s the only reason I exclude single states. I agree that the length of a state is a kind of cross-entropy, because you can choose a delta distribution, but I draw a firm line between cross-entropy and entropy.
(Obviously there’s a special case, where a code that has a single empty codeword for a single state matches a delta distribution. But not if the codeword isn’t the empty string.)
I wonder if it would be reasonable to use “xentropy” for the broad sense of “entropy” in OP, with the understanding that xentropy is always a two-argument function.
“The length of a codeword is the xentropy between [the delta distribution located at] the state and [the coinflip distribution implied by] the code”
Cool cool. I can personally see the appeal of reserving ‘entropy’ for the case where the distribution and the (natural lifting of) the code (to a distribution) are identical, i.e. your proposal without the wiggle-room.
I don’t yet personally see a boundary between the wiggle-room you’re considering and full-on “we can say ‘entropy’ as a shorthand for ‘cross-entropy’ when the second distribution is clear from context” proposal.
In particular, I currently suspect that there’s enough wiggle-room in “optimal code of its type for compression under the source distribution” to drive a truck through. Like, if we start out with a uniform code C and a state s, why not say that the “type of codes” for the source distribution δ(s) is the powerset of {c ∈ C | c codes for s}? In which case the “optimal code for compression” is the set of all such c, and the ‘entropy’ is the Nate!complexity?
I’m not yet sure whether our different aesthetics here are due to:
me failing to see a natural boundary that you’re pointing to
you not having yet seen how slippery the slope is
you having a higher tolerance for saying “humans sometimes just wanna put fences halfway down the slippery slope, dude”.
Insofar as you think I’m making the mistake of (1), I’m interested to hear arguments. My argument above is ofc tuned to case (2), and it’s plausible to me that it pushes you off the fence towards “no wiggle room”.
Another place we might asethetically differ is that I’m much happier blurring the line between entropy and cross-entropy.
One handwavy argument for blurring the line (which has the epistemic status: regurgitating from a related cache that doesn’t cleanly apply) is that if the statespace is uncountably infinite then we need a measure in order to talk about entropy (and make everything work out nicely under change-of-variables). And so in the general case, entropy is already a two-place
predicatefunction involving a distribution and some sort of measure. (...Although my cache is unclear on whether this generalization yields the KL divergence or the cross-entropy. Probably KL. But, like, if we’re holding the correct distribution fixed, then they differ only by a constant. Hopefully. This is as far as I can go w/out warming up that cache, and I don’t wanna warm it up right now \shrug.)Yup, I think I am happy to abandon the wiggle room at this point, for this reason.
I think my preferred approach to this is that the density p(x) is not really the fundamental object, and should be thought of as dP/dmu(x), with the measure in the denominator. We multiply by dmu(x) in the integral for entropy in order to remove this dependence on mu that we accidentally introduced.EDIT: this is flagrantly wrong because log(p) depends on the measure also. You’re right that this is really a function of the distribution and the measure; I’m not sure offhand if it’s crossentropy, either, but I’m going to think about this more. (This is an embarrassing mistake because I already knew differential entropy was cursed with dependence on a measure—quantum mechanics famously provides the measure on phase-space that classical statistical mechanics took as axiomatic.)For what it’s worth, I’ve heard the take “entropy and differential entropy are different sorts of things” several times; I might be coming around to that, now that I see another slippery slope on the horizon.
I’d be interested in a citation of what you’re referring to here!
The state-space (for particles) in statmech is the space of possible positions and momenta for all particles.
The measure that’s used is uniform over each coordinate of position and momentum, for each particle.
This is pretty obvious and natural, but not forced on us, and:
You get different, incorrect predictions about thermodynamics (!) if you use a different measure.
The level of coarse graining is unknown, so every quantity of entropy has an extra “+ log(# microstates per unit measure)” which is an unknown additive constant. (I think this is separate from the relationship between bits and J/K, which is a multiplicative constant for entropy—k_B—and doesn’t rely on QM afaik.)
On the other hand, Liouville’s theorem gives some pretty strong justification for using this measure, alleviating (1) somewhat:
https://en.wikipedia.org/wiki/Liouville%27s_theorem_(Hamiltonian)
In quantum mechanics, you have discrete energy eigenstates (...in a bound system, there are technicalities here...) and you can define a microstate to be an energy eigenstate, which lets you just count things and not worry about measure. This solves both problems:
Counting microstates and taking the classical limit gives the “dx dp” (aka “dq dp”) measure, ruling out any other measure.
It tells you how big your microstates are in phase space (the answer is related to Planck’s constant, which you’ll note has units of position * momentum).
This section mostly talks about the question of coarse-graining, but you can see that “dx dp” is sort of put in by hand in the classical version: https://en.wikipedia.org/wiki/Entropy_(statistical_thermodynamics)#Counting_of_microstates
I wish I had a better citation but I’m not sure I do.
In general it seems like (2) is talked about more in the literature, even though I think (1) is more interesting. This could be because Liouville’s theorem provides enough justification for most people’s tastes.
Finally, knowing “how big your microstates are” is what tells you where quantum effects kick in. (Or vice versa—Planck estimated the value of the Planck constant by adjusting the spacing of his quantized energy levels until his predictions for blackbody radiation matched the data.)
:D
(strong upvote for shifting position in realtime, even if by small degrees and towards the opposite side of the fence from me :-p)
(I’m not actually familiar enough w/ statmech to know what measure we use on phase-space, and I’m interested in links that explain what it is, and how it falls out of QM, if you have any handy :D)
I don’t currently have much sympathy for “entropy and differential entropy are different” view, b/c I occasionally have use for non-uniform measures even in the finite case.
Like, maybe I’m working with distributions over 10 states, and I have certain constraints I’m trying to hit, and subject to those constraints I wanna maximize entropy wrt the Binomial distribution as the measure. And you might be like “Well, stop doing that, and start working with distributions on 2^10 states, and convert all your constraints to refer to the bitcount of the new state (instead of the value of the old state). Now you can meet your constraints while maximizing entropy wrt the uniform measure like a normal person.” To which I’d reply (somewhat trollishly) “that rejoinder doesn’t work in the limit, so it must get weaker and weaker as I work with more and more states, and at numbers as large as 10 this rejoinder is so weak as to not compel me.”
From my perspective, the obvious rejoinder to “entropy is already two-place” is “insofar as entropy is two-place, cross-entropy is three-place!”. Which, ftr, I might find compelling. It depends whether differential cross-entropy needs all three parameters (P, Q, μ), or whether we can combine two of them (like by using (P, μP/Q) or something). Or, at least, that’s what my intuition says off-the-cuff; I’m running on guesswork and a stale cache here :-p.
I think this is roughly where I’m at now.
After thinking a bit and peeking at Wikipedia, the situation seems to be:
The differential entropy of a probability density p is usually defined as
−∫p(x)log(p(x))dx
This is unfortunate, because it isn’t invariant under coordinate transformations on x. A more principled (e.g. invariant) thing to write down, courtesy of Jaynes, is
−∫p(x)log(p(x)m(x))dx
where m=dμ/dx is a density function for some measure μ. We can also write this as
−∫log(dPdμ)dP (Jaynes’ continuous entropy of P with respect to μ)
in terms of a probability measure P with p=dP/dx, which is a bit more clearly invariant.
Now we can define a cross-entropy-like thing as
−∫log(dQdμ)dP (continuous cross-entropy of Q under P with respect to μ)
...and a small surprise is coming up. Jumping back to the discrete case, the KL divergence or “relative entropy” is
D(P||Q)=∑xP(X)log(P(x)Q(x))=CrossEntropy(P,Q)−Entropy(P)
What happens when we try to write something analogous with our new continuous entropy and crossentropy? We get
−∫log(dQdμ)dP+∫log(dPdμ)dP=∫log(dPdQ)dP
which looks like the right generalization of discrete KL divergence, but also happens to be the continuous entropy of P with respect to Q! (The dependence on μ drops out.) This is weird but I think it might be a red herring.
We can recover the usual definitions in the discrete (finite or infinite) case by taking μ to be a uniform measure that assigns 1 to each state (note that this is NOT a probability measure—I never said μ was a probability measure, and I don’t know how to handle the countably infinite case if it is.)
So maybe the nicest single thing to define, for the sake of making everything else a concisely-specified special case, is
H(P,Q;μ)=EP[−log(dQdμ)]=−∫log(dQdμ)dP
(“H” chosen because it is used for both entropy and cross-entropy, afaik unlike “S”).
We could take KL divergence as a two-argument atomic thing, or work with −∫log(f)dP for a scalar function f and a distribution P, but I think these both make it cumbersome to talk about the things everyone wants to talk about all the time. I’m open to those possibilities, though.
The weird observation about KL divergence above is a consequence of the identity
H(P,Q;μ)−H(P,R;μ)=H(P,Q;R)
but I think this is a slightly strange fact because R is a probability measure and μ is a general measure. Hmm.
We also have
H(P,Q;μ)=−H(P,μ;Q)
in the case that μ is a probability measure, but not otherwise.
And I can rattle off some claims about special cases in different contexts:
A (the?) fundamental theorem of entropy-like stuff is H(P,Q;μ)≥H(P,P;μ), or “crossentropy >= entropy”.
In the discrete case, practitioners of information theory and statistical physics tend to assume μ is uniform with measure 1 on each state, and (almost?) never discuss it directly. I’ll call this measure U.
They reserve the phrase “entropy of P” (in the discrete case) for H(P, P; U), and this turns out to be an extremely useful quantity.
They use “cross-entropy of Q relative to P” to mean H(P, Q; U)
They use “KL divergence from Q to P” (or “relative entropy …”) to mean H(P, Q; P) = -H(P, P; Q) = H(P, Q; U) - H(P, P; U)
For a state s, let’s define delta_s to be the distribution that puts measure 1 on s and 0 everywhere else.
For a code C, let’s define f_C to be the “coin flip” distribution that puts probability 2^-len(C(s)) on state s (or the sum of this over all codewords for s, if there are several). (Problem: what if C doesn’t assign all its available strings, so that this isn’t a probability distribution? idk.)
(see also https://en.wikipedia.org/wiki/Cross_entropy#Motivation)
If a state s has only one codeword C(s), the length of that codeword is len(C(s)) = H(delta_s, f_C; U)
Note that this is not H(delta_s, delta_s; U), the entropy of delta_s, which is always zero. This is a lot of why I don’t like taking “entropy of s” to mean H(delta_s, f_C; U).
Information theorists use “surprisal of s under P” (or “self-information …”) to mean H(delta_s, P; U)
The expected length of a code C, for a source distribution P, is H(P, f_C; U)
Consequence from fundamental thm: a code is optimal iff f_C = P; the minimal expected length of the code is the entropy of P
If you use the code C when you should’ve used D, your expected length is H(f_D, f_C; U) and you have foolishly wasted H(f_D, f_C; f_D) bits in expectation.
Physicists use “entropy of a macrostate” to mean things like H(p_A, p_A; U) where p_A is a uniform probability measure over a subset A of the state space, and possibly also for other things of the form H(P, P; U) where P is some probability measure that corresponds to your beliefs if you only know macroscopic things.
Kolmogorov complexity (“algorithmic entropy”) of a state s is, as you argue, maybe best seen as an approximation of H(delta_s, p_Solomonoff; U). I don’t see a good way of justifying the name by analogy to any “entropy” in any other context, but maybe I’m missing something.
This makes that one quote from Wiktionary into something of the form: “H(P, p_Solomonoff; U) ~= H(P, P, U) whenever P is computed by a short program”. This makes sense for some sense of “~=” that I haven’t thought through.
When working in continuous spaces like R, people typically define “[differential] entropy of P” as H(P, P; L) where L is the Lebesgue measure (including the uniform measure on R). If they are working with coordinate transformations from R to R, they implicitly treat one variable as the “true” one which the measure should be uniform on, often without saying so.
They define cross-entropy as H(P, Q; L)....
...and KL divergence as H(P, Q; L) - H(P, P; L) = H(P, Q; P) (same as discrete case!)
But sometimes people use Jaynes’ definitions, which replace L with an arbitrary (and explicitly-specified) measure, like we’re doing here. In that case, “entropy” is H(P, P; mu) for an arbitrary mu.
The properness of the logarithmic scoring rule also follows from the fundamental theorem. Your (subjective) expected score is H(your_beliefs, your_stated_beliefs; U) and this is minimized for your_stated_beliefs = your_beliefs.
I wanted to let that comment be about the interesting question of how we unify these various things.
But on the ongoing topic of “why not call all this entropy, if it’s all clearly part of the same pattern?”:
When the definition of some F(x) refers to x twice, it’s often useful to replace one of them with y and call that G(x, y). But it’s usually not good for communication to choose a name for G(x, y) that (almost) everyone else uses exclusively for F(x), especially if you aren’t going to mention both x and y every time you use it, and doubly especially if G is already popular enough to have lots of names of its own (you might hate those names, but get your own[1]).
e.g.: x*y is not “the square of x and y” much less “the square of x [and y is implied from context]”, and the dot product v.w is not “the norm-squared of v and w” etc.
[1] might I suggest “xentropy”?
:D
(strong upvote for delicious technical content)
(also fyi, the markdown syntax for footnotes is like
blah blah blah[^1]
and then somewhere else in the file, on a newline,[^1]: content of the footnote
)This updates me a fair bit towards the view that we should keep entropy and cross-entropy separate.
The remaining crux for me is whether the info theory folks are already using “entropy” to mean “the minimum expected surprise you can achieve by choosing a code from this here space of preferred codes” (as per your “wiggle room” above), in which case my inclination is to instead cackle madly while racing my truck through their wiggle-room, b/c that’s clearly a cross-entropy.
At a glance through wikipedia, it doesn’t look like they’re doing that, though, so I retreat.
But, hmm, I’m not sure I retreat all the way to having separate words.
I’m persuaded that “the square of x” should absolutely not mean “x * y [where y is implied from context]”, and I no longer think that “entropy” should mean the cross-entropy with Q implied from context. (Thanks for the distillation of why it’s mad!)
But, like, if geometrists instead invented a two-place operation
rect
, with a general formrect(x, y) := x * y
, and declared thatrect(x)
was shorthand forrect(x, x)
, then I would not protest; this seems like a reasonable an unambiguous way to reduce the number of names floating around.[1]And this seems to me like exactly what the information theorists (by contrast to the physicists) are doing with H (by contrast to S)!
Like, the notations H(P) and H(P,Q) are just begging for us to pronounce H the same way in each case; we’re not tempted to pronounce the former as “eych” and the latter as “cross-eych”. And no ambiguity arises, so long as we use the rule that if Q is left out then we mean P.
Thus, I propose the following aggressive nomenclature:
the “P-entropy of Q (wrt μ)” aka Hμ(P,Q) is the general form
the “entropy of P (wrt μ)” aka Hμ(P) is a shorthand for “the P-entropy of P (wrt μ)”
μ may be omitted when the canonical uniform measure is clear from context
With the following bold extension:
the “Q-complexity of P (wrt μ)” aka Kμ(Q,P) is another name for Hμ(P,Q)
the “complexity of P (wrt μ)” aka Kμ(P) is used when Q can be inferred from context
we cast codes/UTMs/points/sets to distributions in natural ways
Thus, when we have a canonical universal Turing machine U on hand, “the complexity of f” aka K(f) means “the Solomonoff(U)-complexity of δ(f)” aka “the δ(f)-entropy of Solomonoff(U)”.
(I’m not wedded to the word “complexity” or the symbol “K” there. What I want is terms and symbols that let me quickly say something that means the ləg-sum-əxp of the code-lengths of all codes for f in the implicit coding scheme.)
(One reason I like the pronunciation “P-entropy of Q” is that saying “P-entropy” feels intuitively like we’re obvously weighting by P (perhaps b/c of the analogy to “P-expectation”), and so I expect it to help me remember which distribution is which in H(P, Q).)
As a fallback, I think ‘xentropy’ is decent, but my guess is that ‘crossent’ is better. It has fewer syllables, it’s already common, and it’s more obvious what it means.
(If I was using ‘crossent’ a lot I might start using ‘selfent’ to mean the entropy, b/c of the symmetry and the saved syllable. Three syllables for ‘entropy’ is kinda a lot!)
That said, I currently personally prefer using the same word for both in this case, and note that it’s very heavily precedented by info theory’s H.
Musing aloud for my own sake: it’s not in general bad to say “the general form is F(x,y), but when we say F(x) that means to infer y from context”. We do that even here, with the μ in entropy! The thing that’s bad is when F(x,x) is a very common case. In particular, the rule ”F(x) means F(x,figure_it_out)” and the rule ”F(x) means F(x,x)” are in conflict, and you can only have one of those active at a time. And in the case of entropy, the H(P,P) rule has a much, much stronger claim to the throne. So we concede that H(P) means H(P,P) and not H(P,figure_it_out), but without conceding that you can’t invent other notation that says Q should be inferred from context.
I agree with all the claims in this comment and I rather like your naming suggestions! Especially the “P-entropy of Q = Q-complexity of P” trick which seems to handle many use cases nicely.
(So the word “entropy” wasn’t really my crux? Maybe not!)
Convergence! 🎉🎉. Thanks for the discussion; I think we wound up somewhere cooler than I would have gotten to on my own.
Now we just need to convince Alex :-p. Alex, are you still reading? I’m curious for your takes on our glorious proposal.
Heh, I’m still skimming enough to catch this, but definitely not evaluating arguments.
I’m definitely still open to both changing my mind about the best use of terms and also updating the terminology in the sequence (although I suspect that will be quite a non-trivial amount of modified prose). And I think it’s best if I don’t actually think about it until after I publish another post.
I’d also be much more inclined to think harder about this discussion if there were more than two people involved.
My main goal here has always been “clearly explain the existing content so that people can understand it”, which is very different from “propose a unification of the whole field” (it’s just that “unification” is my native method of understanding).
Cool cool. A summary of the claims that feel most important to me (for your convenience, and b/c I’m having fun):
K-complexity / “algorithmic entropy” is a bad concept that doesn’t cleanly relate to physics!entropy or info!entropy.
In particular, the K-complexity of a state s is just the length of the shortest code for s, and this is bad because when s has multiple codes it should count as “simpler”. (A state with five 3-bit codes is simpler than a state with one 2-bit code.) (Which is why symmetry makes a concept simpler despite not making its code shorter.)
If we correct our notion of “complexity” to take multiple codes into account, then we find that complexity of a state s (with respect to a coding scheme C) is just the info!cross-entropy H(s,C). Yay!
Separately, some gripes:
the algorithmic information theory concept is knuckleheaded, and only approximates info!entropy if you squint really hard, and I’m annoyed about it
I suspect that a bunch of the annoying theorems in algorithmic information theory are annoying precisely because of all the squinting you have to do to pretend that K-complexity was a good idea
And some pedagogical notes:
I’m all for descriptive accounts of who uses “entropy” for what, but it’s kinda a subtle situation because:
info!entropy is a very general concept,
physics!entropy is an interesting special case of that concept (in the case where the state is a particular breed of physical macrostate),
algo!entropy is a derpy mistake that’s sniffing glue in the corner,
algo!entropy is sitting right next to a heretofore unnamed concept that is another interesting special case of info!(cross-)entropy (in the case where the code is universal).
(oh and there’s a bonus subtlety that if you port algo!entropy to a situation where the coding schema has at most one code per state—which is emphatically not the case in algorithmic information theory—then in that limited case it agrees with info!cross-entropy. Which means that the derpiness of algo!entropy is just barely not relevant to this particular post.)
So, uh, good luck.
And separately, the key points about notation (while I’m summarizing, acknowledging that the notation might not to change in your coming posts):
The corrected notion of the “complexity” of a state s given a coding scheme C is H(s,C)
It’s confusing to call this the “entropy” of s, because that sounds like H(s), which by long-established convenient convention means H(s,s).
An elegant solution is to just call this the “(C-)complexity” of s instead.
This highlights a cool reciprocal relationship between (cross-)entropy and complexity, where the s-entropy of C is the same as the C-complexity of s. This can perhaps be mined for intuition.
Endorsed.