I wanted to let that comment be about the interesting question of how we unify these various things.
But on the ongoing topic of “why not call all this entropy, if it’s all clearly part of the same pattern?”:
When the definition of some F(x) refers to x twice, it’s often useful to replace one of them with y and call that G(x, y). But it’s usually not good for communication to choose a name for G(x, y) that (almost) everyone else uses exclusively for F(x),especially if you aren’t going to mention both x and y every time you use it, and doubly especially if G is already popular enough to have lots of names of its own (you might hate those names, but get your own[1]).
e.g.: x*y is not “the square of x and y” much less “the square of x [and y is implied from context]”, and the dot product v.w is not “the norm-squared of v and w” etc.
(also fyi, the markdown syntax for footnotes is like blah blah blah[^1] and then somewhere else in the file, on a newline, [^1]: content of the footnote)
This updates me a fair bit towards the view that we should keep entropy and cross-entropy separate.
The remaining crux for me is whether the info theory folks are already using “entropy” to mean “the minimum expected surprise you can achieve by choosing a code from this here space of preferred codes” (as per your “wiggle room” above), in which case my inclination is to instead cackle madly while racing my truck through their wiggle-room, b/c that’s clearly a cross-entropy.
At a glance through wikipedia, it doesn’t look like they’re doing that, though, so I retreat.
But, hmm, I’m not sure I retreat all the way to having separate words.
I’m persuaded that “the square of x” should absolutely not mean “x * y [where y is implied from context]”, and I no longer think that “entropy” should mean the cross-entropy with Q implied from context. (Thanks for the distillation of why it’s mad!)
But, like, if geometrists instead invented a two-place operation rect, with a general form rect(x, y) := x * y, and declared that rect(x) was shorthand for rect(x, x), then I would not protest; this seems like a reasonable an unambiguous way to reduce the number of names floating around.[1]
And this seems to me like exactly what the information theorists (by contrast to the physicists) are doing with H (by contrast to S)!
Like, the notations H(P) and H(P,Q) are just begging for us to pronounce H the same way in each case; we’re not tempted to pronounce the former as “eych” and the latter as “cross-eych”. And no ambiguity arises, so long as we use the rule that if Q is left out then we mean P.
Thus, I propose the following aggressive nomenclature:
the “P-entropy of Q (wrt μ)” aka Hμ(P,Q) is the general form
the “entropy of P (wrt μ)” aka Hμ(P) is a shorthand for “the P-entropy of P (wrt μ)”
μ may be omitted when the canonical uniform measure is clear from context
With the following bold extension:
the “Q-complexity of P (wrt μ)” aka Kμ(Q,P) is another name for Hμ(P,Q)
the “complexity of P (wrt μ)” aka Kμ(P) is used when Q can be inferred from context
we cast codes/UTMs/points/sets to distributions in natural ways
Thus, when we have a canonical universal Turing machine U on hand, “the complexity of f” aka K(f) means “the Solomonoff(U)-complexity of δ(f)” aka “the δ(f)-entropy of Solomonoff(U)”.
(I’m not wedded to the word “complexity” or the symbol “K” there. What I want is terms and symbols that let me quickly say something that means the ləg-sum-əxp of the code-lengths of all codes for f in the implicit coding scheme.)
(One reason I like the pronunciation “P-entropy of Q” is that saying “P-entropy” feels intuitively like we’re obvously weighting by P (perhaps b/c of the analogy to “P-expectation”), and so I expect it to help me remember which distribution is which in H(P, Q).)
As a fallback, I think ‘xentropy’ is decent, but my guess is that ‘crossent’ is better. It has fewer syllables, it’s already common, and it’s more obvious what it means.
(If I was using ‘crossent’ a lot I might start using ‘selfent’ to mean the entropy, b/c of the symmetry and the saved syllable. Three syllables for ‘entropy’ is kinda a lot!)
That said, I currently personally prefer using the same word for both in this case, and note that it’s very heavily precedented by info theory’s H.
Musing aloud for my own sake: it’s not in general bad to say “the general form is F(x,y), but when we say F(x) that means to infer y from context”. We do that even here, with the μ in entropy! The thing that’s bad is when F(x,x) is a very common case. In particular, the rule ”F(x) means F(x,figure_it_out)” and the rule ”F(x) means F(x,x)” are in conflict, and you can only have one of those active at a time. And in the case of entropy, the H(P,P) rule has a much, much stronger claim to the throne. So we concede that H(P) means H(P,P) and not H(P,figure_it_out), but without conceding that you can’t invent other notation that says Q should be inferred from context.
I agree with all the claims in this comment and I rather like your naming suggestions! Especially the “P-entropy of Q = Q-complexity of P” trick which seems to handle many use cases nicely.
(So the word “entropy” wasn’t really my crux? Maybe not!)
Heh, I’m still skimming enough to catch this, but definitely not evaluating arguments.
I’m definitely still open to both changing my mind about the best use of terms and also updating the terminology in the sequence (although I suspect that will be quite a non-trivial amount of modified prose). And I think it’s best if I don’t actually think about it until after I publish another post.
I’d also be much more inclined to think harder about this discussion if there were more than two people involved.
My main goal here has always been “clearly explain the existing content so that people can understand it”, which is very different from “propose a unification of the whole field” (it’s just that “unification” is my native method of understanding).
Cool cool. A summary of the claims that feel most important to me (for your convenience, and b/c I’m having fun):
K-complexity / “algorithmic entropy” is a bad concept that doesn’t cleanly relate to physics!entropy or info!entropy.
In particular, the K-complexity of a state s is just the length of the shortest code for s, and this is bad because when s has multiple codes it should count as “simpler”. (A state with five 3-bit codes is simpler than a state with one 2-bit code.) (Which is why symmetry makes a concept simpler despite not making its code shorter.)
If we correct our notion of “complexity” to take multiple codes into account, then we find that complexity of a state s (with respect to a coding scheme C) is just the info!cross-entropy H(s,C). Yay!
Separately, some gripes:
the algorithmic information theory concept is knuckleheaded, and only approximates info!entropy if you squint really hard, and I’m annoyed about it
I suspect that a bunch of the annoying theorems in algorithmic information theory are annoying precisely because of all the squinting you have to do to pretend that K-complexity was a good idea
And some pedagogical notes:
I’m all for descriptive accounts of who uses “entropy” for what, but it’s kinda a subtle situation because:
info!entropy is a very general concept,
physics!entropy is an interesting special case of that concept (in the case where the state is a particular breed of physical macrostate),
algo!entropy is a derpy mistake that’s sniffing glue in the corner,
algo!entropy is sitting right next to a heretofore unnamed concept that is another interesting special case of info!(cross-)entropy (in the case where the code is universal).
(oh and there’s a bonus subtlety that if you port algo!entropy to a situation where the coding schema has at most one code per state—which is emphatically not the case in algorithmic information theory—then in that limited case it agrees with info!cross-entropy. Which means that the derpiness of algo!entropy is just barely not relevant to this particular post.)
So, uh, good luck.
And separately, the key points about notation (while I’m summarizing, acknowledging that the notation might not to change in your coming posts):
The corrected notion of the “complexity” of a state s given a coding scheme C is H(s,C)
It’s confusing to call this the “entropy” of s, because that sounds like H(s), which by long-established convenient convention means H(s,s).
An elegant solution is to just call this the “(C-)complexity” of s instead.
This highlights a cool reciprocal relationship between (cross-)entropy and complexity, where the s-entropy of C is the same as the C-complexity of s. This can perhaps be mined for intuition.
I wanted to let that comment be about the interesting question of how we unify these various things.
But on the ongoing topic of “why not call all this entropy, if it’s all clearly part of the same pattern?”:
When the definition of some F(x) refers to x twice, it’s often useful to replace one of them with y and call that G(x, y). But it’s usually not good for communication to choose a name for G(x, y) that (almost) everyone else uses exclusively for F(x), especially if you aren’t going to mention both x and y every time you use it, and doubly especially if G is already popular enough to have lots of names of its own (you might hate those names, but get your own[1]).
e.g.: x*y is not “the square of x and y” much less “the square of x [and y is implied from context]”, and the dot product v.w is not “the norm-squared of v and w” etc.
[1] might I suggest “xentropy”?
:D
(strong upvote for delicious technical content)
(also fyi, the markdown syntax for footnotes is like
blah blah blah[^1]
and then somewhere else in the file, on a newline,[^1]: content of the footnote
)This updates me a fair bit towards the view that we should keep entropy and cross-entropy separate.
The remaining crux for me is whether the info theory folks are already using “entropy” to mean “the minimum expected surprise you can achieve by choosing a code from this here space of preferred codes” (as per your “wiggle room” above), in which case my inclination is to instead cackle madly while racing my truck through their wiggle-room, b/c that’s clearly a cross-entropy.
At a glance through wikipedia, it doesn’t look like they’re doing that, though, so I retreat.
But, hmm, I’m not sure I retreat all the way to having separate words.
I’m persuaded that “the square of x” should absolutely not mean “x * y [where y is implied from context]”, and I no longer think that “entropy” should mean the cross-entropy with Q implied from context. (Thanks for the distillation of why it’s mad!)
But, like, if geometrists instead invented a two-place operation
rect
, with a general formrect(x, y) := x * y
, and declared thatrect(x)
was shorthand forrect(x, x)
, then I would not protest; this seems like a reasonable an unambiguous way to reduce the number of names floating around.[1]And this seems to me like exactly what the information theorists (by contrast to the physicists) are doing with H (by contrast to S)!
Like, the notations H(P) and H(P,Q) are just begging for us to pronounce H the same way in each case; we’re not tempted to pronounce the former as “eych” and the latter as “cross-eych”. And no ambiguity arises, so long as we use the rule that if Q is left out then we mean P.
Thus, I propose the following aggressive nomenclature:
the “P-entropy of Q (wrt μ)” aka Hμ(P,Q) is the general form
the “entropy of P (wrt μ)” aka Hμ(P) is a shorthand for “the P-entropy of P (wrt μ)”
μ may be omitted when the canonical uniform measure is clear from context
With the following bold extension:
the “Q-complexity of P (wrt μ)” aka Kμ(Q,P) is another name for Hμ(P,Q)
the “complexity of P (wrt μ)” aka Kμ(P) is used when Q can be inferred from context
we cast codes/UTMs/points/sets to distributions in natural ways
Thus, when we have a canonical universal Turing machine U on hand, “the complexity of f” aka K(f) means “the Solomonoff(U)-complexity of δ(f)” aka “the δ(f)-entropy of Solomonoff(U)”.
(I’m not wedded to the word “complexity” or the symbol “K” there. What I want is terms and symbols that let me quickly say something that means the ləg-sum-əxp of the code-lengths of all codes for f in the implicit coding scheme.)
(One reason I like the pronunciation “P-entropy of Q” is that saying “P-entropy” feels intuitively like we’re obvously weighting by P (perhaps b/c of the analogy to “P-expectation”), and so I expect it to help me remember which distribution is which in H(P, Q).)
As a fallback, I think ‘xentropy’ is decent, but my guess is that ‘crossent’ is better. It has fewer syllables, it’s already common, and it’s more obvious what it means.
(If I was using ‘crossent’ a lot I might start using ‘selfent’ to mean the entropy, b/c of the symmetry and the saved syllable. Three syllables for ‘entropy’ is kinda a lot!)
That said, I currently personally prefer using the same word for both in this case, and note that it’s very heavily precedented by info theory’s H.
Musing aloud for my own sake: it’s not in general bad to say “the general form is F(x,y), but when we say F(x) that means to infer y from context”. We do that even here, with the μ in entropy! The thing that’s bad is when F(x,x) is a very common case. In particular, the rule ”F(x) means F(x,figure_it_out)” and the rule ”F(x) means F(x,x)” are in conflict, and you can only have one of those active at a time. And in the case of entropy, the H(P,P) rule has a much, much stronger claim to the throne. So we concede that H(P) means H(P,P) and not H(P,figure_it_out), but without conceding that you can’t invent other notation that says Q should be inferred from context.
I agree with all the claims in this comment and I rather like your naming suggestions! Especially the “P-entropy of Q = Q-complexity of P” trick which seems to handle many use cases nicely.
(So the word “entropy” wasn’t really my crux? Maybe not!)
Convergence! 🎉🎉. Thanks for the discussion; I think we wound up somewhere cooler than I would have gotten to on my own.
Now we just need to convince Alex :-p. Alex, are you still reading? I’m curious for your takes on our glorious proposal.
Heh, I’m still skimming enough to catch this, but definitely not evaluating arguments.
I’m definitely still open to both changing my mind about the best use of terms and also updating the terminology in the sequence (although I suspect that will be quite a non-trivial amount of modified prose). And I think it’s best if I don’t actually think about it until after I publish another post.
I’d also be much more inclined to think harder about this discussion if there were more than two people involved.
My main goal here has always been “clearly explain the existing content so that people can understand it”, which is very different from “propose a unification of the whole field” (it’s just that “unification” is my native method of understanding).
Cool cool. A summary of the claims that feel most important to me (for your convenience, and b/c I’m having fun):
K-complexity / “algorithmic entropy” is a bad concept that doesn’t cleanly relate to physics!entropy or info!entropy.
In particular, the K-complexity of a state s is just the length of the shortest code for s, and this is bad because when s has multiple codes it should count as “simpler”. (A state with five 3-bit codes is simpler than a state with one 2-bit code.) (Which is why symmetry makes a concept simpler despite not making its code shorter.)
If we correct our notion of “complexity” to take multiple codes into account, then we find that complexity of a state s (with respect to a coding scheme C) is just the info!cross-entropy H(s,C). Yay!
Separately, some gripes:
the algorithmic information theory concept is knuckleheaded, and only approximates info!entropy if you squint really hard, and I’m annoyed about it
I suspect that a bunch of the annoying theorems in algorithmic information theory are annoying precisely because of all the squinting you have to do to pretend that K-complexity was a good idea
And some pedagogical notes:
I’m all for descriptive accounts of who uses “entropy” for what, but it’s kinda a subtle situation because:
info!entropy is a very general concept,
physics!entropy is an interesting special case of that concept (in the case where the state is a particular breed of physical macrostate),
algo!entropy is a derpy mistake that’s sniffing glue in the corner,
algo!entropy is sitting right next to a heretofore unnamed concept that is another interesting special case of info!(cross-)entropy (in the case where the code is universal).
(oh and there’s a bonus subtlety that if you port algo!entropy to a situation where the coding schema has at most one code per state—which is emphatically not the case in algorithmic information theory—then in that limited case it agrees with info!cross-entropy. Which means that the derpiness of algo!entropy is just barely not relevant to this particular post.)
So, uh, good luck.
And separately, the key points about notation (while I’m summarizing, acknowledging that the notation might not to change in your coming posts):
The corrected notion of the “complexity” of a state s given a coding scheme C is H(s,C)
It’s confusing to call this the “entropy” of s, because that sounds like H(s), which by long-established convenient convention means H(s,s).
An elegant solution is to just call this the “(C-)complexity” of s instead.
This highlights a cool reciprocal relationship between (cross-)entropy and complexity, where the s-entropy of C is the same as the C-complexity of s. This can perhaps be mined for intuition.
Endorsed.